Skip to content

Fix Discord gateway zombie state after failed auto-retry#1374

Merged
Aaronontheweb merged 1 commit into
netclaw-dev:devfrom
Aaronontheweb:fix/discord-lifecycle-retry-zombie
Jun 9, 2026
Merged

Fix Discord gateway zombie state after failed auto-retry#1374
Aaronontheweb merged 1 commit into
netclaw-dev:devfrom
Aaronontheweb:fix/discord-lifecycle-retry-zombie

Conversation

@Aaronontheweb

@Aaronontheweb Aaronontheweb commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Root-causes and fixes the Discord reliability incident observed in testlab on 0.24.0-beta.2: the gateway dropped every inbound Discord message for 30+ minutes ("Dropping Discord message … while gateway is not ready: Discord gateway disconnected.") while the WebSocket was demonstrably connected and delivering events.

Root cause

The auto-retry path connects with ActorRefs.Nobody as the reply-to, and StartConnecting stored it verbatim — so every _pendingConnectReplyTo is null check misfired:

  1. A transport drop while Ready put the actor in Connecting awaiting Discord.Net's self-reconnect; the 30s READY timeout fired and correctly drove a clean reconnect + auto-retry (Attempting Discord reconnect (delay was 00:00:05) — visible in Seq at 21:15:28).
  2. The retried connect also failed to reach READY in 30s. HandleReadyTimedOut saw Nobody pass the is not null pending-caller check, "failed" the pending connect by telling Nobody (silent no-op), and parked the actor in CleanReconnectRequiredsilently (matching the Seq silence after 21:15:58).
  3. CleanReconnectRequired swallows Connected/READY events, handles no RetryConnect, and arms no timers. When Discord.Net self-reconnected at 21:45, the actor never noticed — permanent zombie until daemon restart.

The same Nobody-vs-null skew made isRetry compute false on retries, so ConnectionRestored never published after an auto-retry recovery — in both the Discord and Mattermost lifecycle actors.

Fix

Normalize Nobodynull at the single storage point in StartConnecting (both actors). "No pending caller" now has exactly one representation: the ready-timeout path falls through to RequestCleanReconnect (publish → clean stop → scheduled retry — the self-healing loop), and ConnectionRestored fires on retry recoveries.

Tests (mutation-verified: both fail against the unfixed actors)

  • Discord (DiscordGatewayLifecycleRetryTimeoutTests, virtual-time via TestScheduler): drives the exact incident sequence — ready → spurious-Connected clean reconnect → auto-retry → READY timeout → asserts a second clean-reconnect request is emitted and the actor recovers when READY finally arrives, publishing ConnectionRestored.
  • Mattermost: auto-retry recovery must publish ConnectionRestored exactly once.

Validation

  • Full Netclaw.Actors.Tests suite: 2,282 passed
  • dotnet slopwatch analyze: clean; headers verified

Recommend cutting 0.24.0-beta.3 once merged — beta.2's Discord channel cannot recover from this state without a daemon restart.

@Aaronontheweb Aaronontheweb added bug Something isn't working channels Discord, Slack, and other channels. labels Jun 9, 2026

@Aaronontheweb Aaronontheweb left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) June 9, 2026 22:12
Root cause of the 0.24.0-beta.2 reliability incident (testlab Seq event
event-8f659382c67008de26f7351717000000): the auto-retry path connects
with ActorRefs.Nobody as the reply-to, and StartConnecting stored it
verbatim — so every `_pendingConnectReplyTo is null` check misfired.

When a retried connect timed out waiting for READY, HandleReadyTimedOut
took the caller-driven branch (Nobody passed the not-null check), failed
the "pending connect" by telling Nobody (a silent no-op), and parked the
actor in CleanReconnectRequired — a state that swallows Connected/READY
events, handles no RetryConnect, and arms no timers. Nothing could ever
leave it: when Discord.Net later self-reconnected, the actor dropped
every inbound message ("Dropping Discord message ... while gateway is
not ready") until the daemon was restarted. The same skew made
`isRetry = _pendingConnectReplyTo is null` compute false on retries, so
ConnectionRestored never published after an auto-retry recovery — in
both the Discord and Mattermost lifecycle actors.

Fix: normalize Nobody to null at the single storage point in
StartConnecting, in both actors. "No pending caller" now has exactly one
representation, so the ready-timeout path falls through to
RequestCleanReconnect (publish → clean stop → scheduled retry — the
self-healing loop), and ConnectionRestored fires on retry recoveries.

Regression tests, both mutation-verified (fail on the unfixed actors):
- Discord: virtual-time test driving the exact incident sequence —
  ready → spurious-Connected clean reconnect → auto-retry → READY
  timeout → must emit a second clean-reconnect request and recover when
  READY finally arrives, publishing ConnectionRestored
- Mattermost: auto-retry recovery must publish ConnectionRestored

Full Actors suite 2,282 green; slopwatch clean.
@Aaronontheweb Aaronontheweb force-pushed the fix/discord-lifecycle-retry-zombie branch from 5817505 to 7faee97 Compare June 9, 2026 22:19
@Aaronontheweb Aaronontheweb merged commit 2ef3d9d into netclaw-dev:dev Jun 9, 2026
15 checks passed
Aaronontheweb added a commit that referenced this pull request Jun 10, 2026
…cycle reliability (SPEC-015) (#1375)

Implements SPEC-015 Phases 2-7 (contract test bases, IGatewaySnapshot, generic lifecycle/gateway/conversation actor bases, RemoteChatChannelBuilder, IChannelOutboundClient send consolidation), adds blank-query destination discovery to lookup_channel_destination, and fixes every confirmed finding from a max-effort review — including the two remaining gateway-zombie paths beyond the #1374 hotfix (CleanReconnectRequired deleted; one canonical fail path), the Healthy-but-deaf connect-timeout window, clean-reconnect backoff, multi-channel thread-history fetcher cross-wiring, Mattermost proactive DM ACL, and registry-validated outbound clients. OpenSpec deltas and the adding-a-channel runbook updated to the as-built design.

Full solution suite green; every behavioral fix has a regression test that fails when reverted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working channels Discord, Slack, and other channels.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant