Skip to content

ClusterClient.SSubscribe silently re-subscribes to a random node after reconnect — PubSub.conn() ignores c.schannels #3806

@igorkofman

Description

@igorkofman

Expected Behavior

A *PubSub created via ClusterClient.SSubscribe should reconnect to the
node owning the shard channel's hash slot after a connection failure, the
same way it does for the initial connection.

Current Behavior

After any connection loss (server restart, transient network blip,
CLIENT KILL, idle timeout), the PubSub auto-reconnects to a random
cluster node
and re-issues SSUBSCRIBE there. On any node other than the
slot owner, Redis replies -MOVED. The reply is never read (the resubscribe
path is write-only), so the failure is silent: the PubSub looks healthy,
Ping succeeds, Receive/Channel() keep returning, but no messages arrive
because the subscriber isn't on the shard the publishers reach.

On an N-node cluster the chance of landing on the wrong node is (N-1)/N per
reconnect. The only recovery is to close and recreate the PubSub (or restart
the process).

Root Cause

pubsub.go:90
in PubSub.conn() builds the channels slice that's handed to the
newConn callback for node selection — but it only collects c.channels
(regular SUBSCRIBE), not c.schannels (sharded SSUBSCRIBE):

channels := slices.Collect(maps.Keys(c.channels))
channels = append(channels, newChannels...)

cn, err := c.newConn(ctx, c.opt.Addr, channels)

For a PubSub that only has sharded subscriptions — the normal case for
ClusterClient.SSubscribechannels is empty when reconnecting. The
ClusterClient's newConn closure
(osscluster.go:2152-2178)
then takes the else branch and picks a random node:

if len(channels) > 0 {
    slot := hashtag.Slot(channels[0])
    ...
    node, err = c.slotMasterNode(ctx, slot)
    ...
} else {
    node, err = c.nodes.Random()       // ← reconnect lands here
    ...
}

resubscribe()
(pubsub.go:127-128)
then writes SSUBSCRIBE over the new conn, but _subscribe() is
write-only and never reads the reply. The -MOVED from the wrong node is
either dropped, or eventually surfaces as an error from ReceiveTimeout
but isBadConn()
(error.go:204-210)
returns false for a MOVED that points at a different address, so it
doesn't trigger another reconnect.

The initial connection isn't affected, because PubSub.subscribe() passes
the new channels through the newChannels argument to conn(), and they're
appended to the routing list. The bug only bites the reconnect path,
where newChannels is nil and c.schannels is the only place the shard
channels live.

Steps to Reproduce

  1. Start a Redis cluster (≥ 2 master shards).
  2. pubsub := clusterClient.SSubscribe(ctx, "shard-chan").
  3. Verify delivery works: clusterClient.SPublish(ctx, "shard-chan", "x")
    pubsub.ReceiveTimeout(...) returns the message.
  4. Forcibly close the PubSub connection on the slot owner — e.g.
    CLIENT KILL TYPE pubsub on that node — without sending SUNSUBSCRIBE.
  5. Trigger reconnect (pubsub.Ping(ctx), or just wait for the next
    Receive / health-check ping).
  6. SPublish again → ReceiveTimeout times out, and SPUBLISH's return
    value is 0 (no subscribers reached on the slot owner).

A regression test in the repo's existing ginkgo cluster harness will be
included in the PR (osscluster_test.go); it iterates the
kill→reconnect→publish cycle 8 times so a lucky random-node hit can't mask
the bug.

Possible Solution

Include c.schannels when building the channel list passed to newConn:

channels := slices.Collect(maps.Keys(c.channels))
channels = append(channels, slices.Collect(maps.Keys(c.schannels))...)
channels = append(channels, newChannels...)

Sharded channels are appended after regular channels so that for
SSubscribe-only PubSubs (the common case) channels[0] is a shard
channel and slot routing works, while PubSubs using only regular
SUBSCRIBE/PSUBSCRIBE see no behavior change.

The channels argument is unused for routing in redis.Client.pubSub()
(single node, fixed addr), Ring.SSubscribe() (shard chosen before the
PubSub is created, owned by a single redis.Client), and SentinelClient
(no SSubscribe), so the change only affects ClusterClient.

Context (Environment)

  • go-redis: v9.17.1 (also reproduced against v9.19.0 and master @
    8cdff5946a35)
  • Server: Redis Cluster 7.x (sharded pub/sub requires ≥ 7.0)
  • Production impact: on a 25-shard cluster, a transient connection blip
    caused message receive rate to drop ~90% (24/25 chance of wrong node) and
    never recover. The PubSub health-check ping kept succeeding against the
    wrong node, so the failure went undetected for ~6 hours until a process
    restart.

Note on mixed regular + sharded subscriptions

A PubSub carrying both regular and sharded subscriptions on a cluster
client is already underdetermined (regular channels can be served by any
node; sharded channels must be on the slot owner; a single conn can't
satisfy both for arbitrary channel sets). This fix doesn't change behavior
for that case — channels[0] is still a regular channel and routing follows
it. ClusterClient.Subscribe/PSubscribe/SSubscribe each return a fresh
PubSub, so the mixed case only arises if the caller mixes
Subscribe/SSubscribe calls on the same PubSub deliberately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions