Upstream Bug 3 — Gateway becomes a zombie after system CA rotation; Discord "logged in" READY log line also missing
Repo: github.com/openclaw/openclaw
Suggested labels: bug, gateway, discord, stability, observability
OpenClaw version: 2026.4.5
Node: 22.22.1
OS: macOS 14 (Apple Silicon)
Severity: High — silent outage that the built-in reconnect loop cannot recover from
Title
Long-running gateway becomes permanently unable to connect to Discord after midnight system CA rotation ("certificate has expired"); internal reconnect loop cannot recover, only a full launchctl kickstart fixes it. Separately, the [discord] logged in to discord as … READY log line no longer fires, making the broken state invisible to watchdogs.
Summary
Two related problems surfaced together during a 2026-04-08 outage:
Problem A — Cached system CAs cause unrecoverable zombie state
The gateway daemon reads the system's root CA store once at process startup (it runs with NODE_USE_SYSTEM_CA=1 per its launchd env). When the OS keychain rotates an intermediate or root CA while the process is already running, the gateway's cached TLS context retains the old trust anchors forever. All subsequent outbound TLS connections — most visibly to gateway.discord.gg and discord.com — fail with Error: certificate has expired, even though a freshly-spawned Node process on the same machine can complete the TLS handshake without issue.
The embedded Discord provider's auto-restart loop (attempts 2/10 → 10/10 with exponential backoff up to 300s) cannot recover from this, because every restart-within-the-same-process reuses the same cached CAs.
Problem B — "logged in" READY log line is missing in 2026.4.5
In prior builds the gateway emitted [discord] logged in to discord as <id> (<username>) when the Discord client's 'ready' event fired. In 2026.4.5 we only see [discord] client initialized as <id> (<username>); awaiting gateway readiness and then nothing — even when the bot is actually fully connected, has live presence, and is responding to REST operations. This makes it impossible for watchdogs (or humans grepping the log) to tell whether the bot is healthy or stuck.
Environment
openclaw@2026.4.5
- Node 22.22.1 (
NODE_USE_SYSTEM_CA=1 in launchd env)
- macOS 14 (Apple Silicon)
- LaunchAgent:
ai.openclaw.gateway
Reproduction — Problem A
- Start the gateway via
launchctl. Let it run overnight.
- During the overnight period, a root or intermediate CA in the macOS keychain expires (or is rotated).
- The gateway will begin emitting, at ~30s intervals:
[discord] gateway metadata lookup failed transiently; using default gateway url
(Failed to get gateway information from Discord: fetch failed)
[discord] channel resolve failed; using config entries. fetch failed | certificate has expired
- The internal auto-restart loop will cycle attempts 2→10 over ~10 minutes, all failing the same way.
- At ~30 min the Discord provider hits
Max reconnect attempts (50) reached after code 1006 and is disposed. The gateway then enters a permanent suppressed-error state:
[discord] suppressed late gateway other error after dispose: Error: certificate has expired
[discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
- Verify the problem is process-local: in a separate terminal, run
node -e "const tls=require('tls'); const s=tls.connect(443,'gateway.discord.gg',{servername:'gateway.discord.gg'},()=>{console.log(s.authorized); s.end()});"
→ fresh Node process trusts the cert fine. Only the running gateway daemon is broken.
Actual log evidence (2026-04-08 incident)
2026-04-07T18:05:11.650-05:00 [discord] logged in to discord as 1484016201360212069 (Clawbot) ← last healthy login (old log format)
2026-04-07T21:13:01 gateway process PID 49603 started (launchd)
2026-04-08T00:29:40.940-05:00 [discord] gateway metadata lookup failed transiently; using default gateway url (Failed to get gateway information from Discord: fetch failed)
2026-04-08T00:30:22.368-05:00 [discord] channel resolve failed; using config entries. fetch failed | certificate has expired
2026-04-08T00:30:11.218-05:00 [discord] [default] auto-restart attempt 2/10 in 11s
... attempts 3-10 all fail identically ...
2026-04-08T01:16:57.772-05:00 [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
2026-04-08T01:22:28.689-05:00 [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
... "certificate has expired" flood every 30s for next 9 hours ...
2026-04-08T10:12:51.582-05:00 [gateway] ready (5 plugins, 0.7s) ← after manual `launchctl kickstart -k`
2026-04-08T10:13:00.170-05:00 [discord] client initialized as 1484016201360212069 (Clawbot); awaiting gateway readiness
← no "logged in" line, and that's the LAST [discord] log entry,
but the bot is actually READY and presence/REST both work.
Reproduction — Problem B
- Restart the gateway cleanly (
launchctl kickstart -k gui/$UID/ai.openclaw.gateway).
- Tail
~/.openclaw/logs/gateway.log | grep '\[discord\]'.
- Observe:
[discord] client initialized as <id> (<username>); awaiting gateway readiness fires, then the [discord] log channel goes silent.
- Verify the bot actually is READY:
lsof -p <gateway_pid> shows ESTABLISHED TCP to 162.159.130.234:https / 162.159.134.234:https (gateway.discord.gg IPs)
- REST calls via the bot token succeed:
curl -H "Authorization: Bot $TOKEN" https://discord.com/api/v10/users/@me → 200
- A POST to any channel via REST succeeds and shows in Discord
- Hard-refreshing the Discord client shows the bot as online
- The missing READY log line means you cannot tell from logs alone whether the bot is stuck awaiting IDENTIFY/READY or is actually fine.
Expected behavior
Problem A
The gateway should either:
- Periodically refresh its cached TLS trust store (read system CAs on each new outbound connection, not just at process start), or
- Detect the "certificate has expired" signature in its own error stream and trigger a hard process respawn (not just an internal reconnect attempt), or
- Expose a plugin-level hook so operators can trigger
launchctl kickstart -k from a watchdog when this signature is detected.
Problem B
When the embedded Discord client fires 'ready', the gateway should log it with identifying detail, at minimum:
[discord] ready: logged in as <username> (<id>) · <guild_count> guilds · shard <n>/<m>
Watchdogs and operators rely on this line to distinguish "bot is healthy" from "bot is stuck in IDENTIFY".
Actual behavior
Problem A
- Internal reconnect loop hits its 50-attempt ceiling within ~30 min, then gives up entirely
- The "dispose" state is terminal within the process lifetime
- Operators must manually
launchctl kickstart -k to recover
- We discovered this 9 hours after the outage began because there was no externally-visible signal
Problem B
[discord] client initialized as … ; awaiting gateway readiness is the terminal log line on a healthy boot
- Log-based health checks are blind to actual bot state
- The
discord-post.mjs workaround in our repo exists specifically because REST is the only reliable write path
Impact
- 9-hour silent outage on 2026-04-08 (00:29 → 10:12 CT). Bot appeared offline in Discord's member list. All Discord-channel-based agent interactions were dead.
- No alerts fired because the built-in auto-restart loop swallowed the error state and the watchdog had no "healthy" log line to look for.
- Only resolved when a human noticed the bot was gray and manually kicked the gateway.
Suggested fix
For Problem A
- Add a periodic system-CA refresh — either on a timer (hourly?) or per-new-outbound-TLS-session.
- Pattern-match "certificate has expired" in error handlers and escalate to
process.exit(1) so launchd respawns cleanly (instead of an internal reconnect that inherits the broken state).
- Document the
NODE_USE_SYSTEM_CA=1 caveat — anyone operating this long-running service needs to know that system CA rotations require a full restart.
For Problem B
- Restore the "logged in" log line on Discord client
'ready' event — or add a new one in the same spirit.
- Also log identify/resume attempts and their outcomes (session_id, shard_id, intents ack'd) so the handshake path is observable.
Workaround we are using
- One-shot manual recovery:
launchctl kickstart -k gui/$UID/ai.openclaw.gateway after noticing the bot is gray.
- REST-only post path (
tools/discord-post.mjs) that bypasses the gateway entirely for cron jobs — noted in its own header comment that this was added to avoid "session conflicts with the OpenClaw gateway bot, which was causing Clawbot to appear offline every time a cron job posted a message."
- Planned: patching our gateway-watchdog to grep
gateway.err.log for N+ occurrences of certificate has expired in 5 min and force a kickstart.
Related
Upstream Bug 3 — Gateway becomes a zombie after system CA rotation; Discord "logged in" READY log line also missing
Repo: github.com/openclaw/openclaw
Suggested labels:
bug,gateway,discord,stability,observabilityOpenClaw version: 2026.4.5
Node: 22.22.1
OS: macOS 14 (Apple Silicon)
Severity: High — silent outage that the built-in reconnect loop cannot recover from
Title
Long-running gateway becomes permanently unable to connect to Discord after midnight system CA rotation ("certificate has expired"); internal reconnect loop cannot recover, only a full
launchctl kickstartfixes it. Separately, the[discord] logged in to discord as …READY log line no longer fires, making the broken state invisible to watchdogs.Summary
Two related problems surfaced together during a 2026-04-08 outage:
Problem A — Cached system CAs cause unrecoverable zombie state
The gateway daemon reads the system's root CA store once at process startup (it runs with
NODE_USE_SYSTEM_CA=1per its launchd env). When the OS keychain rotates an intermediate or root CA while the process is already running, the gateway's cached TLS context retains the old trust anchors forever. All subsequent outbound TLS connections — most visibly togateway.discord.gganddiscord.com— fail withError: certificate has expired, even though a freshly-spawned Node process on the same machine can complete the TLS handshake without issue.The embedded Discord provider's auto-restart loop (attempts 2/10 → 10/10 with exponential backoff up to 300s) cannot recover from this, because every restart-within-the-same-process reuses the same cached CAs.
Problem B — "logged in" READY log line is missing in 2026.4.5
In prior builds the gateway emitted
[discord] logged in to discord as <id> (<username>)when the Discord client's'ready'event fired. In 2026.4.5 we only see[discord] client initialized as <id> (<username>); awaiting gateway readinessand then nothing — even when the bot is actually fully connected, has live presence, and is responding to REST operations. This makes it impossible for watchdogs (or humans grepping the log) to tell whether the bot is healthy or stuck.Environment
openclaw@2026.4.5NODE_USE_SYSTEM_CA=1in launchd env)ai.openclaw.gatewayReproduction — Problem A
launchctl. Let it run overnight.Max reconnect attempts (50) reached after code 1006and is disposed. The gateway then enters a permanent suppressed-error state:Actual log evidence (2026-04-08 incident)
Reproduction — Problem B
launchctl kickstart -k gui/$UID/ai.openclaw.gateway).~/.openclaw/logs/gateway.log | grep '\[discord\]'.[discord] client initialized as <id> (<username>); awaiting gateway readinessfires, then the[discord]log channel goes silent.lsof -p <gateway_pid>shows ESTABLISHED TCP to162.159.130.234:https/162.159.134.234:https(gateway.discord.gg IPs)curl -H "Authorization: Bot $TOKEN" https://discord.com/api/v10/users/@me→ 200Expected behavior
Problem A
The gateway should either:
launchctl kickstart -kfrom a watchdog when this signature is detected.Problem B
When the embedded Discord client fires
'ready', the gateway should log it with identifying detail, at minimum:Watchdogs and operators rely on this line to distinguish "bot is healthy" from "bot is stuck in IDENTIFY".
Actual behavior
Problem A
launchctl kickstart -kto recoverProblem B
[discord] client initialized as … ; awaiting gateway readinessis the terminal log line on a healthy bootdiscord-post.mjsworkaround in our repo exists specifically because REST is the only reliable write pathImpact
Suggested fix
For Problem A
process.exit(1)so launchd respawns cleanly (instead of an internal reconnect that inherits the broken state).NODE_USE_SYSTEM_CA=1caveat — anyone operating this long-running service needs to know that system CA rotations require a full restart.For Problem B
'ready'event — or add a new one in the same spirit.Workaround we are using
launchctl kickstart -k gui/$UID/ai.openclaw.gatewayafter noticing the bot is gray.tools/discord-post.mjs) that bypasses the gateway entirely for cron jobs — noted in its own header comment that this was added to avoid "session conflicts with the OpenClaw gateway bot, which was causing Clawbot to appear offline every time a cron job posted a message."gateway.err.logfor N+ occurrences ofcertificate has expiredin 5 min and force a kickstart.Related
sessions_spawnmodelApplied lying → sessions_spawn returns modelApplied:true while actually running a stale resumed model #63221~/.openclaw/workspace/output/post-restart-fallback-cascade-incident-report.md