Skip to content

fix(pubsub): include shard channels in newConn routing list#3829

Merged
ndyakov merged 1 commit into
masterfrom
fix/3806-ssubscribe-reconnect-includes-schannels
May 27, 2026
Merged

fix(pubsub): include shard channels in newConn routing list#3829
ndyakov merged 1 commit into
masterfrom
fix/3806-ssubscribe-reconnect-includes-schannels

Conversation

@ndyakov

@ndyakov ndyakov commented May 27, 2026

Copy link
Copy Markdown
Member

In ClusterClient.SSubscribe, the per-PubSub newConn closure picks the target node by hash slot of channels[0]. PubSub.conn() was only passing c.channels (regular SUBSCRIBE) into that closure — sharded SSubscribe channels stored in c.schannels were silently dropped.

For an SSubscribe-only PubSub (the common ClusterClient.SSubscribe case) this meant the channel list was empty on reconnect, so the closure fell through to nodes.Random(). The reconnected SSUBSCRIBE went to the wrong shard, the MOVED reply was never read by the write-only resubscribe path, and the PubSub looked healthy (Ping succeeds, Receive returns) while delivering no messages. On an N-shard cluster the failure mode hit (N-1)/N reconnects.

Add c.schannels to the channel list built in PubSub.conn(). c.patterns is intentionally NOT added: patterns are not slot-addressable, and including them would pin PSubscribe-only PubSubs to a single pattern-hash node and lose the existing random-node behaviour.

Adds TestPubSubConn_PassesPersistedChannelsToNewConn covering five cases (subscribe-only reconnect, ssubscribe-only reconnect, initial ssubscribe, mixed, and ordering of newChannels) via a stub newConn that captures the channel list. The cluster routing layer itself is not under test — the regression is whether the channel list reaches the closure intact.

Fixes #3806


Note

Medium Risk
Changes cluster pub/sub reconnect routing for SSubscribe and mixed subscriptions; scope is small but wrong routing would silently drop messages.

Overview
PubSub.conn() now forwards persisted shard (SSubscribe) channel names to the newConn callback together with regular subscribe channels and any newChannels, so cluster reconnects can route by slot instead of picking a random node when only sharded subscriptions exist (#3806).

c.patterns is still omitted from that routing list so pattern-only pub/sub keeps its prior random-node behavior.

Adds TestPubSubConn_PassesPersistedChannelsToNewConn with a stub newConn that asserts the combined channel list for subscribe-only, ssubscribe-only, mixed, and ordering cases.

Reviewed by Cursor Bugbot for commit 9a53e6c. Bugbot is set up for automated code reviews on this repo. Configure here.

In ClusterClient.SSubscribe, the per-PubSub newConn closure picks the
target node by hash slot of channels[0]. PubSub.conn() was only passing
c.channels (regular SUBSCRIBE) into that closure — sharded SSubscribe
channels stored in c.schannels were silently dropped.

For an SSubscribe-only PubSub (the common ClusterClient.SSubscribe case)
this meant the channel list was empty on reconnect, so the closure fell
through to nodes.Random(). The reconnected SSUBSCRIBE went to the wrong
shard, the MOVED reply was never read by the write-only resubscribe
path, and the PubSub looked healthy (Ping succeeds, Receive returns)
while delivering no messages. On an N-shard cluster the failure mode hit
(N-1)/N reconnects.

Add c.schannels to the channel list built in PubSub.conn(). c.patterns
is intentionally NOT added: patterns are not slot-addressable, and
including them would pin PSubscribe-only PubSubs to a single
pattern-hash node and lose the existing random-node behaviour.

Adds TestPubSubConn_PassesPersistedChannelsToNewConn covering five cases
(subscribe-only reconnect, ssubscribe-only reconnect, initial
ssubscribe, mixed, and ordering of newChannels) via a stub newConn that
captures the channel list. The cluster routing layer itself is not
under test — the regression is whether the channel list reaches the
closure intact.

Fixes #3806
@jit-ci

jit-ci Bot commented May 27, 2026

Copy link
Copy Markdown

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a silent message-loss bug in ClusterClient.SSubscribe reconnects. PubSub.conn() previously passed only c.channels to the newConn closure, so SSubscribe-only PubSubs reconnected to a random node instead of the shard slot owner. The fix appends c.schannels (but intentionally not c.patterns) to the routing list.

Changes:

  • Include c.schannels in the channel list handed to newConn in PubSub.conn().
  • Add TestPubSubConn_PassesPersistedChannelsToNewConn covering subscribe-only, ssubscribe-only, initial, mixed, and ordering cases via a stub newConn.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pubsub.go Append c.schannels keys to the channel list passed to newConn so cluster reconnects route to the slot owner.
internal_test.go Add regression test verifying persisted channels and schannels are forwarded to the newConn closure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ndyakov ndyakov added the bug label May 27, 2026
@ndyakov ndyakov merged commit d8407df into master May 27, 2026
40 checks passed
@ndyakov ndyakov deleted the fix/3806-ssubscribe-reconnect-includes-schannels branch May 27, 2026 13:27
saschazepter pushed a commit to saschazepter/forgejo that referenced this pull request May 29, 2026
This PR contains the following updates:

| Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [github.com/redis/go-redis/v9](https://github.com/redis/go-redis) | `v9.19.0` → `v9.20.0` | ![age](https://developer.mend.io/api/mc/badges/age/go/github.com%2fredis%2fgo-redis%2fv9/v9.20.0?slim=true) | ![confidence](https://developer.mend.io/api/mc/badges/confidence/go/github.com%2fredis%2fgo-redis%2fv9/v9.19.0/v9.20.0?slim=true) |

---

### Release Notes

<details>
<summary>redis/go-redis (github.com/redis/go-redis/v9)</summary>

### [`v9.20.0`](https://github.com/redis/go-redis/releases/tag/v9.20.0): 9.20.0

[Compare Source](redis/go-redis@v9.19.0...v9.20.0)

#### 🚀 Highlights

##### Redis 8.8 Support

This release adds support for **Redis 8.8**. The README's supported-versions list now includes Redis 8.8 alongside 8.0/8.2/8.4, and CI exercises the `8.8` client-libs-test image across the full suite (Makefile, build workflow, doctests, run-tests action, and docker-compose).

Coverage for the new commands that ship in the 8.x line, rounded out in this release:

- **`AR*` array data type** ([#&#8203;3813](redis/go-redis#3813)) — new array data structure, exposed via the `ArrayCmdable` interface (see the experimental-features highlight below).
- **`INCREX`** ([#&#8203;3816](redis/go-redis#3816)) — atomic increment with expiration in a single round-trip.
- **`XNACK`** ([#&#8203;3790](redis/go-redis#3790)) — explicit negative-acknowledge of pending stream entries.
- **`XAUTOCLAIM` PEL deletes** ([#&#8203;3798](redis/go-redis#3798)) — `XAUTOCLAIM`/`XAUTOCLAIMJUSTID` now return the list of deleted message IDs from the pending entries list.
- **`TS.RANGE` multiple aggregators** ([#&#8203;3791](redis/go-redis#3791)) — `TS.RANGE`/`TS.REVRANGE`/`TS.MRANGE`/`TS.MREVRANGE` accept multiple aggregators in a single call.
- **`Z(UNION|INTER|DIFF)` `COUNT` aggregator** ([#&#8203;3802](redis/go-redis#3802)) — `COUNT` reducer for sorted-set set operations.
- **`JSON.SET FPHA`** ([#&#8203;3797](redis/go-redis#3797)) — new `FPHA` argument that specifies the floating-point type for homogeneous FP arrays.

CI image bump ([#&#8203;3814](redis/go-redis#3814)) by [@&#8203;ofekshenawa](https://github.com/ofekshenawa). Command coverage contributions by [@&#8203;cxljs](https://github.com/cxljs), [@&#8203;elena-kolevska](https://github.com/elena-kolevska), [@&#8203;Khukharr](https://github.com/Khukharr), [@&#8203;ndyakov](https://github.com/ndyakov), and [@&#8203;ofekshenawa](https://github.com/ofekshenawa).

##### Stable RESP3 for RediSearch (`UnstableResp3` deprecated)

`FT.SEARCH`, `FT.AGGREGATE`, `FT.INFO`, `FT.SPELLCHECK`, and `FT.SYNDUMP` now parse RESP3 (map) responses into the same typed result objects as RESP2 — `Val()` and `Result()` work uniformly on both protocols, no flag required. Previously, RESP3 search responses required `UnstableResp3: true` and were returned as opaque maps accessible only via `RawResult()` / `RawVal()`.

As a result, the `UnstableResp3` option is now a **no-op** across every options struct (`Options`, `ClusterOptions`, `UniversalOptions`, `FailoverOptions`, `RingOptions`) and has been marked `// Deprecated:`. The field is retained for backwards compatibility — existing code that sets `UnstableResp3: true` will continue to compile and behave identically — but it will be removed in a future release and new code should not set it. `RawResult()` / `RawVal()` continue to work for callers that prefer the raw RESP payload.

([#&#8203;3741](redis/go-redis#3741)) by [@&#8203;ndyakov](https://github.com/ndyakov)

##### Experimental Array Data Structure Commands

Adds an experimental `ArrayCmdable` interface with the `AR*` command family (`ARSet`, `ARGet`, `ARGetRange`, `ARMSet`, `ARMGet`, `ARDel`, `ARDelRange`, `ARScan`, `ARSeek`, `ARNext`, `ARLastItems`, `ARGrep`, `ARGrepWithValues`, `ARInfo`/`ARInfoFull`, and typed reducers `AROpSum`/`AROpMin`/`AROpMax`/`AROpAnd`/`AROpOr`/`AROpXor`/`AROpMatch`/`AROpUsed`) for working with Redis 8.8's new array data type. **API is experimental and may change in a future release.**

([#&#8203;3813](redis/go-redis#3813)) by [@&#8203;cxljs](https://github.com/cxljs)

#### ✨ New Features

- **RESP3 search parser**: First-class RESP3 parsing for `FT.SEARCH`/`FT.AGGREGATE`/`FT.INFO`/`FT.SPELLCHECK`/`FT.SYNDUMP` responses with backwards compatibility for RESP2 ([#&#8203;3741](redis/go-redis#3741)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **INCREX**: New `INCREX` command support — atomic increment with expiration ([#&#8203;3816](redis/go-redis#3816)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **XNACK**: Client support for the `XNACK` stream command for explicitly negative-acknowledging pending entries ([#&#8203;3790](redis/go-redis#3790)) by [@&#8203;elena-kolevska](https://github.com/elena-kolevska)
- **TS range multiple aggregators**: `TS.RANGE`/`TS.REVRANGE`/`TS.MRANGE`/`TS.MREVRANGE` now accept multiple aggregators in a single call ([#&#8203;3791](redis/go-redis#3791)) by [@&#8203;elena-kolevska](https://github.com/elena-kolevska)
- **`XAutoClaim` deleted IDs**: `XAUTOCLAIM`/`XAUTOCLAIMJUSTID` now return the list of deleted message IDs from the PEL ([#&#8203;3798](redis/go-redis#3798)) by [@&#8203;Khukharr](https://github.com/Khukharr)
- **`JSON.SET FPHA`**: `JSON.SET` accepts a new `FPHA` argument that specifies the floating-point type for homogeneous floating-point arrays ([#&#8203;3797](redis/go-redis#3797)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **Sorted-set union/intersection COUNT**: `ZUNION`/`ZINTER`/`ZDIFF` aggregator now supports `COUNT` ([#&#8203;3802](redis/go-redis#3802)) by [@&#8203;ofekshenawa](https://github.com/ofekshenawa)
- **`FT.HYBRID` vector validation**: Validates hybrid-search vector input types and adds proper typed vector parameters ([#&#8203;3756](redis/go-redis#3756)) by [@&#8203;DengY11](https://github.com/DengY11)
- **Cluster pool wait stats**: `ClusterClient.PoolStats()` now accumulates `WaitCount` and `WaitDurationNs` across all node pools (previously always zero) ([#&#8203;3809](redis/go-redis#3809)) by [@&#8203;LINKIWI](https://github.com/LINKIWI)

#### 🐛 Bug Fixes

- **TLS-only Cluster PubSub**: `CLUSTER SLOTS` port-0 entries now fall back to the origin endpoint's port, fixing `dial tcp <ip>:0: connection refused` on TLS-only clusters started with `--port 0 --tls-port <port>` (fixes [#&#8203;3726](redis/go-redis#3726)) ([#&#8203;3828](redis/go-redis#3828)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **Sharded PubSub reconnect routing**: `PubSub.conn()` now passes both regular (`c.channels`) and sharded (`c.schannels`) channels into the per-PubSub `newConn` closure. Previously, `ClusterClient.SSubscribe`-only PubSubs reconnected to a random node (because the routing closure saw an empty channel list), the `SSUBSCRIBE` was sent to the wrong shard, and the resulting `MOVED` reply was silently dropped ([#&#8203;3829](redis/go-redis#3829)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **ClusterClient `Watch` retry**: User errors returned from a `Watch` callback are no longer subjected to cluster-retry classification; transient cluster errors still retry, but a callback returning e.g. `net.ErrClosed` short-circuits immediately ([#&#8203;3821](redis/go-redis#3821)) by [@&#8203;obiyang](https://github.com/obiyang)
- **Sentinel concurrent-probe leak**: `MasterAddr`'s concurrent sentinel probe now closes the non-winning sentinel clients instead of leaking them ([#&#8203;3827](redis/go-redis#3827)) by [@&#8203;cxljs](https://github.com/cxljs)
- **Sentinel rediscovery loop on master-only setups**: `replicaAddrs` no longer tears down the cached sentinel client when the replica list is empty, eliminating a continuous rediscovery loop on master-only Sentinel deployments that flooded logs and added per-operation latency ([#&#8203;3795](redis/go-redis#3795)) by [@&#8203;shahyash2609](https://github.com/shahyash2609)
- **Pool `CloseConn` hooks**: `Pool.CloseConn` now triggers registered hooks, fixing a memory leak when connections are closed explicitly rather than via the normal removal path ([#&#8203;3818](redis/go-redis#3818)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **Dial TCP error redirection**: Wrapped `dial tcp` errors are now correctly classified as redirectable so cluster routing can recover from a single unreachable node ([#&#8203;3810](redis/go-redis#3810)) by [@&#8203;vladisa88](https://github.com/vladisa88)
- **Pool `Close` health checks**: `ConnPool.Close` now only runs health checks against idle connections, avoiding spurious activity on connections still in use ([#&#8203;3805](redis/go-redis#3805)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **VLinks return type**: Fixed the return type of `VLINKS`/`VLINKSWITHSCORES` vector-set replies ([#&#8203;3820](redis/go-redis#3820)) by [@&#8203;romanpovol](https://github.com/romanpovol)

#### 🧪 Testing & Infrastructure

- **Flaky tests**: Stabilized several flaky tests in the sentinel and pool suites ([#&#8203;3815](redis/go-redis#3815)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **Sentinel failover metric race**: Fixed a data race in the sentinel failover metric test ([#&#8203;3824](redis/go-redis#3824)) by [@&#8203;cxljs](https://github.com/cxljs)
- **`waitForSentinelClusterStable` post-conditions**: The sentinel test harness now waits for replicas to be fully connected (not just present in the count) and is robust to randomized spec ordering after failover specs, eliminating an intermittent `Expected master to equal slave` flake ([#&#8203;3830](redis/go-redis#3830)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **`govulncheck` workflow**: New scheduled GitHub Actions workflow runs `govulncheck` on every push, PR, and weekly, surfacing newly disclosed Go vulnerabilities even when no code changes ([#&#8203;3779](redis/go-redis#3779)) by [@&#8203;solardome](https://github.com/solardome)
- **CI Redis 8.8-rc1**: CI now exercises the 8.8-rc1 Redis image ([#&#8203;3814](redis/go-redis#3814)) by [@&#8203;ofekshenawa](https://github.com/ofekshenawa)

#### 🧰 Maintenance

- **`Cmd.Slot()` lookup refactor**: Caches the per-command `CommandInfo` and short-circuits keyless commands before the switch dispatch, removing redundant `Peek` calls ([#&#8203;3804](redis/go-redis#3804)) by [@&#8203;retr0-kernel](https://github.com/retr0-kernel)
- **stdlib `math/rand`**: Replaced `internal/rand` with `math/rand` from the standard library now that the minimum Go version is 1.24 ([#&#8203;3823](redis/go-redis#3823)) by [@&#8203;cxljs](https://github.com/cxljs)
- **ConnPool queue channel**: Removed the unused queue channel from `ConnPool`, trimming the pool's footprint ([#&#8203;3826](redis/go-redis#3826)) by [@&#8203;cxljs](https://github.com/cxljs)
- **Extra packages LICENSE**: Added a LICENSE file to each `extra/*` package ([#&#8203;3817](redis/go-redis#3817)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **README & CI image**: Documentation refresh and bumped the default CI image tag ([#&#8203;3822](redis/go-redis#3822)) by [@&#8203;ndyakov](https://github.com/ndyakov)

#### 👥 Contributors

We'd like to thank all the contributors who worked on this release!

[@&#8203;cxljs](https://github.com/cxljs), [@&#8203;DengY11](https://github.com/DengY11), [@&#8203;elena-kolevska](https://github.com/elena-kolevska), [@&#8203;Khukharr](https://github.com/Khukharr), [@&#8203;LINKIWI](https://github.com/LINKIWI), [@&#8203;ndyakov](https://github.com/ndyakov), [@&#8203;obiyang](https://github.com/obiyang), [@&#8203;ofekshenawa](https://github.com/ofekshenawa), [@&#8203;retr0-kernel](https://github.com/retr0-kernel), [@&#8203;romanpovol](https://github.com/romanpovol), [@&#8203;shahyash2609](https://github.com/shahyash2609), [@&#8203;solardome](https://github.com/solardome), [@&#8203;vladisa88](https://github.com/vladisa88)

***

**Full Changelog**: <redis/go-redis@v9.19.0...v9.20.0>

</details>

---

### Configuration

📅 **Schedule**: (UTC)

- Branch creation
  - Between 12:00 AM and 03:59 AM (`* 0-3 * * *`)
- Automerge
  - Between 12:00 AM and 03:59 AM (`* 0-3 * * *`)

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Mend Renovate](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xOTUuMSIsInVwZGF0ZWRJblZlciI6IjQzLjE5NS4xIiwidGFyZ2V0QnJhbmNoIjoiZm9yZ2VqbyIsImxhYmVscyI6WyJkZXBlbmRlbmN5LXVwZ3JhZGUiLCJ0ZXN0L25vdC1uZWVkZWQiXX0=-->

Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/12804
Reviewed-by: Mathieu Fenniak <mfenniak@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ClusterClient.SSubscribe silently re-subscribes to a random node after reconnect — PubSub.conn() ignores c.schannels

3 participants