Fix Discord gateway zombie state after failed auto-retry by Aaronontheweb · Pull Request #1374 · netclaw-dev/netclaw

Aaronontheweb · 2026-06-09T22:08:14Z

Summary

Root-causes and fixes the Discord reliability incident observed in testlab on 0.24.0-beta.2: the gateway dropped every inbound Discord message for 30+ minutes ("Dropping Discord message … while gateway is not ready: Discord gateway disconnected.") while the WebSocket was demonstrably connected and delivering events.

Root cause

The auto-retry path connects with ActorRefs.Nobody as the reply-to, and StartConnecting stored it verbatim — so every _pendingConnectReplyTo is null check misfired:

A transport drop while Ready put the actor in Connecting awaiting Discord.Net's self-reconnect; the 30s READY timeout fired and correctly drove a clean reconnect + auto-retry (Attempting Discord reconnect (delay was 00:00:05) — visible in Seq at 21:15:28).
The retried connect also failed to reach READY in 30s. HandleReadyTimedOut saw Nobody pass the is not null pending-caller check, "failed" the pending connect by telling Nobody (silent no-op), and parked the actor in CleanReconnectRequired — silently (matching the Seq silence after 21:15:58).
CleanReconnectRequired swallows Connected/READY events, handles no RetryConnect, and arms no timers. When Discord.Net self-reconnected at 21:45, the actor never noticed — permanent zombie until daemon restart.

The same Nobody-vs-null skew made isRetry compute false on retries, so ConnectionRestored never published after an auto-retry recovery — in both the Discord and Mattermost lifecycle actors.

Fix

Normalize Nobody → null at the single storage point in StartConnecting (both actors). "No pending caller" now has exactly one representation: the ready-timeout path falls through to RequestCleanReconnect (publish → clean stop → scheduled retry — the self-healing loop), and ConnectionRestored fires on retry recoveries.

Tests (mutation-verified: both fail against the unfixed actors)

Discord (DiscordGatewayLifecycleRetryTimeoutTests, virtual-time via TestScheduler): drives the exact incident sequence — ready → spurious-Connected clean reconnect → auto-retry → READY timeout → asserts a second clean-reconnect request is emitted and the actor recovers when READY finally arrives, publishing ConnectionRestored.
Mattermost: auto-retry recovery must publish ConnectionRestored exactly once.

Validation

Full Netclaw.Actors.Tests suite: 2,282 passed
dotnet slopwatch analyze: clean; headers verified

Recommend cutting 0.24.0-beta.3 once merged — beta.2's Discord channel cannot recover from this state without a daemon restart.

Aaronontheweb

LGTM

Root cause of the 0.24.0-beta.2 reliability incident (testlab Seq event event-8f659382c67008de26f7351717000000): the auto-retry path connects with ActorRefs.Nobody as the reply-to, and StartConnecting stored it verbatim — so every `_pendingConnectReplyTo is null` check misfired. When a retried connect timed out waiting for READY, HandleReadyTimedOut took the caller-driven branch (Nobody passed the not-null check), failed the "pending connect" by telling Nobody (a silent no-op), and parked the actor in CleanReconnectRequired — a state that swallows Connected/READY events, handles no RetryConnect, and arms no timers. Nothing could ever leave it: when Discord.Net later self-reconnected, the actor dropped every inbound message ("Dropping Discord message ... while gateway is not ready") until the daemon was restarted. The same skew made `isRetry = _pendingConnectReplyTo is null` compute false on retries, so ConnectionRestored never published after an auto-retry recovery — in both the Discord and Mattermost lifecycle actors. Fix: normalize Nobody to null at the single storage point in StartConnecting, in both actors. "No pending caller" now has exactly one representation, so the ready-timeout path falls through to RequestCleanReconnect (publish → clean stop → scheduled retry — the self-healing loop), and ConnectionRestored fires on retry recoveries. Regression tests, both mutation-verified (fail on the unfixed actors): - Discord: virtual-time test driving the exact incident sequence — ready → spurious-Connected clean reconnect → auto-retry → READY timeout → must emit a second clean-reconnect request and recover when READY finally arrives, publishing ConnectionRestored - Mattermost: auto-retry recovery must publish ConnectionRestored Full Actors suite 2,282 green; slopwatch clean.

…cycle reliability (SPEC-015) (#1375) Implements SPEC-015 Phases 2-7 (contract test bases, IGatewaySnapshot, generic lifecycle/gateway/conversation actor bases, RemoteChatChannelBuilder, IChannelOutboundClient send consolidation), adds blank-query destination discovery to lookup_channel_destination, and fixes every confirmed finding from a max-effort review — including the two remaining gateway-zombie paths beyond the #1374 hotfix (CleanReconnectRequired deleted; one canonical fail path), the Healthy-but-deaf connect-timeout window, clean-reconnect backoff, multi-channel thread-history fetcher cross-wiring, Mattermost proactive DM ACL, and registry-validated outbound clients. OpenSpec deltas and the adding-a-channel runbook updated to the as-built design. Full solution suite green; every behavioral fix has a regression test that fails when reverted.

Aaronontheweb added bug Something isn't working channels Discord, Slack, and other channels. labels Jun 9, 2026

Aaronontheweb commented Jun 9, 2026

View reviewed changes

Aaronontheweb enabled auto-merge (squash) June 9, 2026 22:12

Aaronontheweb force-pushed the fix/discord-lifecycle-retry-zombie branch from 5817505 to 7faee97 Compare June 9, 2026 22:19

Aaronontheweb merged commit 2ef3d9d into netclaw-dev:dev Jun 9, 2026
15 checks passed

Aaronontheweb mentioned this pull request Jun 10, 2026

Standardize channel infrastructure, add destination listing, fix lifecycle reliability (SPEC-015) #1375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Discord gateway zombie state after failed auto-retry#1374

Fix Discord gateway zombie state after failed auto-retry#1374
Aaronontheweb merged 1 commit into
netclaw-dev:devfrom
Aaronontheweb:fix/discord-lifecycle-retry-zombie

Aaronontheweb commented Jun 9, 2026 •

edited

Loading

Uh oh!

Aaronontheweb left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Tests (mutation-verified: both fail against the unfixed actors)

Validation

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aaronontheweb commented Jun 9, 2026 •

edited

Loading