Expected Behavior
A *PubSub created via ClusterClient.SSubscribe should reconnect to the
node owning the shard channel's hash slot after a connection failure, the
same way it does for the initial connection.
Current Behavior
After any connection loss (server restart, transient network blip,
CLIENT KILL, idle timeout), the PubSub auto-reconnects to a random
cluster node and re-issues SSUBSCRIBE there. On any node other than the
slot owner, Redis replies -MOVED. The reply is never read (the resubscribe
path is write-only), so the failure is silent: the PubSub looks healthy,
Ping succeeds, Receive/Channel() keep returning, but no messages arrive
because the subscriber isn't on the shard the publishers reach.
On an N-node cluster the chance of landing on the wrong node is (N-1)/N per
reconnect. The only recovery is to close and recreate the PubSub (or restart
the process).
Root Cause
pubsub.go:90
in PubSub.conn() builds the channels slice that's handed to the
newConn callback for node selection — but it only collects c.channels
(regular SUBSCRIBE), not c.schannels (sharded SSUBSCRIBE):
channels := slices.Collect(maps.Keys(c.channels))
channels = append(channels, newChannels...)
cn, err := c.newConn(ctx, c.opt.Addr, channels)
For a PubSub that only has sharded subscriptions — the normal case for
ClusterClient.SSubscribe — channels is empty when reconnecting. The
ClusterClient's newConn closure
(osscluster.go:2152-2178)
then takes the else branch and picks a random node:
if len(channels) > 0 {
slot := hashtag.Slot(channels[0])
...
node, err = c.slotMasterNode(ctx, slot)
...
} else {
node, err = c.nodes.Random() // ← reconnect lands here
...
}
resubscribe()
(pubsub.go:127-128)
then writes SSUBSCRIBE over the new conn, but _subscribe() is
write-only and never reads the reply. The -MOVED from the wrong node is
either dropped, or eventually surfaces as an error from ReceiveTimeout —
but isBadConn()
(error.go:204-210)
returns false for a MOVED that points at a different address, so it
doesn't trigger another reconnect.
The initial connection isn't affected, because PubSub.subscribe() passes
the new channels through the newChannels argument to conn(), and they're
appended to the routing list. The bug only bites the reconnect path,
where newChannels is nil and c.schannels is the only place the shard
channels live.
Steps to Reproduce
- Start a Redis cluster (≥ 2 master shards).
pubsub := clusterClient.SSubscribe(ctx, "shard-chan").
- Verify delivery works:
clusterClient.SPublish(ctx, "shard-chan", "x") →
pubsub.ReceiveTimeout(...) returns the message.
- Forcibly close the PubSub connection on the slot owner — e.g.
CLIENT KILL TYPE pubsub on that node — without sending SUNSUBSCRIBE.
- Trigger reconnect (
pubsub.Ping(ctx), or just wait for the next
Receive / health-check ping).
SPublish again → ReceiveTimeout times out, and SPUBLISH's return
value is 0 (no subscribers reached on the slot owner).
A regression test in the repo's existing ginkgo cluster harness will be
included in the PR (osscluster_test.go); it iterates the
kill→reconnect→publish cycle 8 times so a lucky random-node hit can't mask
the bug.
Possible Solution
Include c.schannels when building the channel list passed to newConn:
channels := slices.Collect(maps.Keys(c.channels))
channels = append(channels, slices.Collect(maps.Keys(c.schannels))...)
channels = append(channels, newChannels...)
Sharded channels are appended after regular channels so that for
SSubscribe-only PubSubs (the common case) channels[0] is a shard
channel and slot routing works, while PubSubs using only regular
SUBSCRIBE/PSUBSCRIBE see no behavior change.
The channels argument is unused for routing in redis.Client.pubSub()
(single node, fixed addr), Ring.SSubscribe() (shard chosen before the
PubSub is created, owned by a single redis.Client), and SentinelClient
(no SSubscribe), so the change only affects ClusterClient.
Context (Environment)
- go-redis: v9.17.1 (also reproduced against v9.19.0 and
master @
8cdff5946a35)
- Server: Redis Cluster 7.x (sharded pub/sub requires ≥ 7.0)
- Production impact: on a 25-shard cluster, a transient connection blip
caused message receive rate to drop ~90% (24/25 chance of wrong node) and
never recover. The PubSub health-check ping kept succeeding against the
wrong node, so the failure went undetected for ~6 hours until a process
restart.
Note on mixed regular + sharded subscriptions
A PubSub carrying both regular and sharded subscriptions on a cluster
client is already underdetermined (regular channels can be served by any
node; sharded channels must be on the slot owner; a single conn can't
satisfy both for arbitrary channel sets). This fix doesn't change behavior
for that case — channels[0] is still a regular channel and routing follows
it. ClusterClient.Subscribe/PSubscribe/SSubscribe each return a fresh
PubSub, so the mixed case only arises if the caller mixes
Subscribe/SSubscribe calls on the same PubSub deliberately.
Expected Behavior
A
*PubSubcreated viaClusterClient.SSubscribeshould reconnect to thenode owning the shard channel's hash slot after a connection failure, the
same way it does for the initial connection.
Current Behavior
After any connection loss (server restart, transient network blip,
CLIENT KILL, idle timeout), thePubSubauto-reconnects to a randomcluster node and re-issues
SSUBSCRIBEthere. On any node other than theslot owner, Redis replies
-MOVED. The reply is never read (theresubscribepath is write-only), so the failure is silent: the
PubSublooks healthy,Pingsucceeds,Receive/Channel()keep returning, but no messages arrivebecause the subscriber isn't on the shard the publishers reach.
On an N-node cluster the chance of landing on the wrong node is
(N-1)/Nperreconnect. The only recovery is to close and recreate the
PubSub(or restartthe process).
Root Cause
pubsub.go:90in
PubSub.conn()builds thechannelsslice that's handed to thenewConncallback for node selection — but it only collectsc.channels(regular
SUBSCRIBE), notc.schannels(shardedSSUBSCRIBE):For a
PubSubthat only has sharded subscriptions — the normal case forClusterClient.SSubscribe—channelsis empty when reconnecting. TheClusterClient'snewConnclosure(
osscluster.go:2152-2178)then takes the
elsebranch and picks a random node:resubscribe()(
pubsub.go:127-128)then writes
SSUBSCRIBEover the new conn, but_subscribe()iswrite-only and never reads the reply. The
-MOVEDfrom the wrong node iseither dropped, or eventually surfaces as an error from
ReceiveTimeout—but
isBadConn()(
error.go:204-210)returns
falsefor aMOVEDthat points at a different address, so itdoesn't trigger another reconnect.
The initial connection isn't affected, because
PubSub.subscribe()passesthe new channels through the
newChannelsargument toconn(), and they'reappended to the routing list. The bug only bites the reconnect path,
where
newChannelsisnilandc.schannelsis the only place the shardchannels live.
Steps to Reproduce
pubsub := clusterClient.SSubscribe(ctx, "shard-chan").clusterClient.SPublish(ctx, "shard-chan", "x")→pubsub.ReceiveTimeout(...)returns the message.CLIENT KILL TYPE pubsubon that node — without sendingSUNSUBSCRIBE.pubsub.Ping(ctx), or just wait for the nextReceive/ health-check ping).SPublishagain →ReceiveTimeouttimes out, andSPUBLISH's returnvalue is
0(no subscribers reached on the slot owner).A regression test in the repo's existing ginkgo cluster harness will be
included in the PR (
osscluster_test.go); it iterates thekill→reconnect→publish cycle 8 times so a lucky random-node hit can't mask
the bug.
Possible Solution
Include
c.schannelswhen building the channel list passed tonewConn:Sharded channels are appended after regular channels so that for
SSubscribe-onlyPubSubs (the common case)channels[0]is a shardchannel and slot routing works, while
PubSubs using only regularSUBSCRIBE/PSUBSCRIBEsee no behavior change.The
channelsargument is unused for routing inredis.Client.pubSub()(single node, fixed
addr),Ring.SSubscribe()(shard chosen before thePubSubis created, owned by a singleredis.Client), andSentinelClient(no
SSubscribe), so the change only affectsClusterClient.Context (Environment)
master@8cdff5946a35)caused message receive rate to drop ~90% (24/25 chance of wrong node) and
never recover. The PubSub health-check ping kept succeeding against the
wrong node, so the failure went undetected for ~6 hours until a process
restart.
Note on mixed regular + sharded subscriptions
A
PubSubcarrying both regular and sharded subscriptions on a clusterclient is already underdetermined (regular channels can be served by any
node; sharded channels must be on the slot owner; a single conn can't
satisfy both for arbitrary channel sets). This fix doesn't change behavior
for that case —
channels[0]is still a regular channel and routing followsit.
ClusterClient.Subscribe/PSubscribe/SSubscribeeach return a freshPubSub, so the mixed case only arises if the caller mixesSubscribe/SSubscribecalls on the samePubSubdeliberately.