Fix Discord gateway zombie state after failed auto-retry#1374
Merged
Aaronontheweb merged 1 commit intoJun 9, 2026
Merged
Conversation
Root cause of the 0.24.0-beta.2 reliability incident (testlab Seq event
event-8f659382c67008de26f7351717000000): the auto-retry path connects
with ActorRefs.Nobody as the reply-to, and StartConnecting stored it
verbatim — so every `_pendingConnectReplyTo is null` check misfired.
When a retried connect timed out waiting for READY, HandleReadyTimedOut
took the caller-driven branch (Nobody passed the not-null check), failed
the "pending connect" by telling Nobody (a silent no-op), and parked the
actor in CleanReconnectRequired — a state that swallows Connected/READY
events, handles no RetryConnect, and arms no timers. Nothing could ever
leave it: when Discord.Net later self-reconnected, the actor dropped
every inbound message ("Dropping Discord message ... while gateway is
not ready") until the daemon was restarted. The same skew made
`isRetry = _pendingConnectReplyTo is null` compute false on retries, so
ConnectionRestored never published after an auto-retry recovery — in
both the Discord and Mattermost lifecycle actors.
Fix: normalize Nobody to null at the single storage point in
StartConnecting, in both actors. "No pending caller" now has exactly one
representation, so the ready-timeout path falls through to
RequestCleanReconnect (publish → clean stop → scheduled retry — the
self-healing loop), and ConnectionRestored fires on retry recoveries.
Regression tests, both mutation-verified (fail on the unfixed actors):
- Discord: virtual-time test driving the exact incident sequence —
ready → spurious-Connected clean reconnect → auto-retry → READY
timeout → must emit a second clean-reconnect request and recover when
READY finally arrives, publishing ConnectionRestored
- Mattermost: auto-retry recovery must publish ConnectionRestored
Full Actors suite 2,282 green; slopwatch clean.
5817505 to
7faee97
Compare
Aaronontheweb
added a commit
that referenced
this pull request
Jun 10, 2026
…cycle reliability (SPEC-015) (#1375) Implements SPEC-015 Phases 2-7 (contract test bases, IGatewaySnapshot, generic lifecycle/gateway/conversation actor bases, RemoteChatChannelBuilder, IChannelOutboundClient send consolidation), adds blank-query destination discovery to lookup_channel_destination, and fixes every confirmed finding from a max-effort review — including the two remaining gateway-zombie paths beyond the #1374 hotfix (CleanReconnectRequired deleted; one canonical fail path), the Healthy-but-deaf connect-timeout window, clean-reconnect backoff, multi-channel thread-history fetcher cross-wiring, Mattermost proactive DM ACL, and registry-validated outbound clients. OpenSpec deltas and the adding-a-channel runbook updated to the as-built design. Full solution suite green; every behavioral fix has a regression test that fails when reverted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Root-causes and fixes the Discord reliability incident observed in testlab on 0.24.0-beta.2: the gateway dropped every inbound Discord message for 30+ minutes ("Dropping Discord message … while gateway is not ready: Discord gateway disconnected.") while the WebSocket was demonstrably connected and delivering events.
Root cause
The auto-retry path connects with
ActorRefs.Nobodyas the reply-to, andStartConnectingstored it verbatim — so every_pendingConnectReplyTo is nullcheck misfired:Attempting Discord reconnect (delay was 00:00:05)— visible in Seq at 21:15:28).HandleReadyTimedOutsawNobodypass theis not nullpending-caller check, "failed" the pending connect by telling Nobody (silent no-op), and parked the actor inCleanReconnectRequired— silently (matching the Seq silence after 21:15:58).CleanReconnectRequiredswallows Connected/READY events, handles noRetryConnect, and arms no timers. When Discord.Net self-reconnected at 21:45, the actor never noticed — permanent zombie until daemon restart.The same
Nobody-vs-nullskew madeisRetrycompute false on retries, soConnectionRestorednever published after an auto-retry recovery — in both the Discord and Mattermost lifecycle actors.Fix
Normalize
Nobody→nullat the single storage point inStartConnecting(both actors). "No pending caller" now has exactly one representation: the ready-timeout path falls through toRequestCleanReconnect(publish → clean stop → scheduled retry — the self-healing loop), andConnectionRestoredfires on retry recoveries.Tests (mutation-verified: both fail against the unfixed actors)
DiscordGatewayLifecycleRetryTimeoutTests, virtual-time viaTestScheduler): drives the exact incident sequence — ready → spurious-Connected clean reconnect → auto-retry → READY timeout → asserts a second clean-reconnect request is emitted and the actor recovers when READY finally arrives, publishingConnectionRestored.ConnectionRestoredexactly once.Validation
Netclaw.Actors.Testssuite: 2,282 passeddotnet slopwatch analyze: clean; headers verifiedRecommend cutting 0.24.0-beta.3 once merged — beta.2's Discord channel cannot recover from this state without a daemon restart.