Gateway becomes zombie after system CA rotation; internal reconnect loop cannot recover; Discord READY log line also missing in 2026.4.5

# Upstream Bug 3 — Gateway becomes a zombie after system CA rotation; Discord "logged in" READY log line also missing

**Repo:** github.com/openclaw/openclaw
**Suggested labels:** `bug`, `gateway`, `discord`, `stability`, `observability`
**OpenClaw version:** 2026.4.5
**Node:** 22.22.1
**OS:** macOS 14 (Apple Silicon)
**Severity:** High — silent outage that the built-in reconnect loop cannot recover from

---

## Title
Long-running gateway becomes permanently unable to connect to Discord after midnight system CA rotation ("certificate has expired"); internal reconnect loop cannot recover, only a full `launchctl kickstart` fixes it. Separately, the `[discord] logged in to discord as …` READY log line no longer fires, making the broken state invisible to watchdogs.

## Summary

Two related problems surfaced together during a 2026-04-08 outage:

### Problem A — Cached system CAs cause unrecoverable zombie state
The gateway daemon reads the system's root CA store **once at process startup** (it runs with `NODE_USE_SYSTEM_CA=1` per its launchd env). When the OS keychain rotates an intermediate or root CA **while the process is already running**, the gateway's cached TLS context retains the old trust anchors forever. All subsequent outbound TLS connections — most visibly to `gateway.discord.gg` and `discord.com` — fail with `Error: certificate has expired`, even though a freshly-spawned Node process on the same machine can complete the TLS handshake without issue.

The embedded Discord provider's auto-restart loop (attempts 2/10 → 10/10 with exponential backoff up to 300s) **cannot recover from this**, because every restart-within-the-same-process reuses the same cached CAs.

### Problem B — "logged in" READY log line is missing in 2026.4.5
In prior builds the gateway emitted `[discord] logged in to discord as <id> (<username>)` when the Discord client's `'ready'` event fired. In 2026.4.5 we only see `[discord] client initialized as <id> (<username>); awaiting gateway readiness` and then nothing — even when the bot is actually fully connected, has live presence, and is responding to REST operations. This makes it impossible for watchdogs (or humans grepping the log) to tell whether the bot is healthy or stuck.

## Environment
- `openclaw@2026.4.5`
- Node 22.22.1 (`NODE_USE_SYSTEM_CA=1` in launchd env)
- macOS 14 (Apple Silicon)
- LaunchAgent: `ai.openclaw.gateway`

## Reproduction — Problem A

1. Start the gateway via `launchctl`. Let it run overnight.
2. During the overnight period, a root or intermediate CA in the macOS keychain expires (or is rotated).
3. The gateway will begin emitting, at ~30s intervals:
   ```
   [discord] gateway metadata lookup failed transiently; using default gateway url
     (Failed to get gateway information from Discord: fetch failed)
   [discord] channel resolve failed; using config entries. fetch failed | certificate has expired
   ```
4. The internal auto-restart loop will cycle attempts 2→10 over ~10 minutes, all failing the same way.
5. At ~30 min the Discord provider hits `Max reconnect attempts (50) reached after code 1006` and is disposed. The gateway then enters a permanent suppressed-error state:
   ```
   [discord] suppressed late gateway other error after dispose: Error: certificate has expired
   [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
   ```
6. Verify the problem is process-local: in a separate terminal, run
   ```js
   node -e "const tls=require('tls'); const s=tls.connect(443,'gateway.discord.gg',{servername:'gateway.discord.gg'},()=>{console.log(s.authorized); s.end()});"
   ```
   → fresh Node process trusts the cert fine. Only the running gateway daemon is broken.

## Actual log evidence (2026-04-08 incident)

```
2026-04-07T18:05:11.650-05:00 [discord] logged in to discord as 1484016201360212069 (Clawbot)   ← last healthy login (old log format)
2026-04-07T21:13:01        gateway process PID 49603 started (launchd)
2026-04-08T00:29:40.940-05:00 [discord] gateway metadata lookup failed transiently; using default gateway url (Failed to get gateway information from Discord: fetch failed)
2026-04-08T00:30:22.368-05:00 [discord] channel resolve failed; using config entries. fetch failed | certificate has expired
2026-04-08T00:30:11.218-05:00 [discord] [default] auto-restart attempt 2/10 in 11s
... attempts 3-10 all fail identically ...
2026-04-08T01:16:57.772-05:00 [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
2026-04-08T01:22:28.689-05:00 [discord] suppressed late gateway reconnect-exhausted error after dispose: Error: Max reconnect attempts (50) reached after code 1006
... "certificate has expired" flood every 30s for next 9 hours ...
2026-04-08T10:12:51.582-05:00 [gateway] ready (5 plugins, 0.7s)                                    ← after manual `launchctl kickstart -k`
2026-04-08T10:13:00.170-05:00 [discord] client initialized as 1484016201360212069 (Clawbot); awaiting gateway readiness
                              ← no "logged in" line, and that's the LAST [discord] log entry,
                                 but the bot is actually READY and presence/REST both work.
```

## Reproduction — Problem B

1. Restart the gateway cleanly (`launchctl kickstart -k gui/$UID/ai.openclaw.gateway`).
2. Tail `~/.openclaw/logs/gateway.log | grep '\[discord\]'`.
3. Observe: `[discord] client initialized as <id> (<username>); awaiting gateway readiness` fires, then the `[discord]` log channel goes silent.
4. Verify the bot actually is READY:
   - `lsof -p <gateway_pid>` shows ESTABLISHED TCP to `162.159.130.234:https` / `162.159.134.234:https` (gateway.discord.gg IPs)
   - REST calls via the bot token succeed: `curl -H "Authorization: Bot $TOKEN" https://discord.com/api/v10/users/@me` → 200
   - A POST to any channel via REST succeeds and shows in Discord
   - Hard-refreshing the Discord client shows the bot as online
5. The missing READY log line means you cannot tell from logs alone whether the bot is stuck awaiting IDENTIFY/READY or is actually fine.

## Expected behavior

### Problem A
The gateway should either:
- **Periodically refresh its cached TLS trust store** (read system CAs on each new outbound connection, not just at process start), or
- **Detect the "certificate has expired" signature** in its own error stream and trigger a hard process respawn (not just an internal reconnect attempt), or
- **Expose a plugin-level hook** so operators can trigger `launchctl kickstart -k` from a watchdog when this signature is detected.

### Problem B
When the embedded Discord client fires `'ready'`, the gateway should log it with identifying detail, at minimum:
```
[discord] ready: logged in as <username> (<id>) · <guild_count> guilds · shard <n>/<m>
```
Watchdogs and operators rely on this line to distinguish "bot is healthy" from "bot is stuck in IDENTIFY".

## Actual behavior

### Problem A
- Internal reconnect loop hits its 50-attempt ceiling within ~30 min, then gives up entirely
- The "dispose" state is terminal within the process lifetime
- Operators must manually `launchctl kickstart -k` to recover
- We discovered this 9 hours after the outage began because there was no externally-visible signal

### Problem B
- `[discord] client initialized as … ; awaiting gateway readiness` is the terminal log line on a healthy boot
- Log-based health checks are blind to actual bot state
- The `discord-post.mjs` workaround in our repo exists specifically because REST is the only reliable write path

## Impact

- **9-hour silent outage** on 2026-04-08 (00:29 → 10:12 CT). Bot appeared offline in Discord's member list. All Discord-channel-based agent interactions were dead.
- No alerts fired because the built-in auto-restart loop swallowed the error state and the watchdog had no "healthy" log line to look for.
- Only resolved when a human noticed the bot was gray and manually kicked the gateway.

## Suggested fix

### For Problem A
1. **Add a periodic system-CA refresh** — either on a timer (hourly?) or per-new-outbound-TLS-session.
2. **Pattern-match "certificate has expired" in error handlers** and escalate to `process.exit(1)` so launchd respawns cleanly (instead of an internal reconnect that inherits the broken state).
3. **Document the `NODE_USE_SYSTEM_CA=1` caveat** — anyone operating this long-running service needs to know that system CA rotations require a full restart.

### For Problem B
1. **Restore the "logged in" log line** on Discord client `'ready'` event — or add a new one in the same spirit.
2. **Also log identify/resume attempts and their outcomes** (session_id, shard_id, intents ack'd) so the handshake path is observable.

## Workaround we are using

1. **One-shot manual recovery:** `launchctl kickstart -k gui/$UID/ai.openclaw.gateway` after noticing the bot is gray.
2. **REST-only post path (`tools/discord-post.mjs`)** that bypasses the gateway entirely for cron jobs — noted in its own header comment that this was added to avoid "session conflicts with the OpenClaw gateway bot, which was causing Clawbot to appear offline every time a cron job posted a message."
3. **Planned:** patching our gateway-watchdog to grep `gateway.err.log` for N+ occurrences of `certificate has expired` in 5 min and force a kickstart.

## Related

- Sister issue: pi-agent-core lifecycle race → #63220
- Sister issue: `sessions_spawn` modelApplied lying → #63221
- Full incident report for the 04-08 outage (local): `~/.openclaw/workspace/output/post-restart-fallback-cascade-incident-report.md`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway becomes zombie after system CA rotation; internal reconnect loop cannot recover; Discord READY log line also missing in 2026.4.5 #63223

Upstream Bug 3 — Gateway becomes a zombie after system CA rotation; Discord "logged in" READY log line also missing

Title

Summary

Problem A — Cached system CAs cause unrecoverable zombie state

Problem B — "logged in" READY log line is missing in 2026.4.5

Environment

Reproduction — Problem A

Actual log evidence (2026-04-08 incident)

Reproduction — Problem B

Expected behavior

Problem A

Problem B

Actual behavior

Problem A

Problem B

Impact

Suggested fix

For Problem A

For Problem B

Workaround we are using

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Gateway becomes zombie after system CA rotation; internal reconnect loop cannot recover; Discord READY log line also missing in 2026.4.5 #63223

Description

Upstream Bug 3 — Gateway becomes a zombie after system CA rotation; Discord "logged in" READY log line also missing

Title

Summary

Problem A — Cached system CAs cause unrecoverable zombie state

Problem B — "logged in" READY log line is missing in 2026.4.5

Environment

Reproduction — Problem A

Actual log evidence (2026-04-08 incident)

Reproduction — Problem B

Expected behavior

Problem A

Problem B

Actual behavior

Problem A

Problem B

Impact

Suggested fix

For Problem A

For Problem B

Workaround we are using

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions