Skip to content

fix: do not close sentinel when replica list is empty#3795

Merged
ndyakov merged 8 commits into
redis:masterfrom
shahyash2609:fix/sentinel-no-replica-rediscovery-loop
May 27, 2026
Merged

fix: do not close sentinel when replica list is empty#3795
ndyakov merged 8 commits into
redis:masterfrom
shahyash2609:fix/sentinel-no-replica-rediscovery-loop

Conversation

@shahyash2609

@shahyash2609 shahyash2609 commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Bug: replicaAddrss() in sentinel.go calls closeSentinel() when getReplicaAddrs returns an empty list without error. This tears down the sentinel connection (and its +switch-master pub/sub subscription), forcing a full sentinel rediscovery on every subsequent operation.
  • Impact: On master-only sentinel setups (no replicas), this creates a continuous rediscovery loop — every dial triggers concurrent queries to all sentinel nodes, flooding logs and adding unnecessary latency.
  • Fix: Return []string{}, nil instead of calling closeSentinel() when replicas are empty but there's no error. An empty replica list is a valid, stable state for master-only deployments. The sentinel connection must be preserved for ongoing master discovery and failover monitoring.

Root cause

In sentinel.go, function replicaAddrss():

} else {
    // No error and no replicas.
    _ = c.closeSentinel()  // ← BUG: destroys sentinel for a valid state
}

After closeSentinel(), c.sentinel is nil and c.pubsub is nil. The next call to MasterAddr() must perform full sentinel discovery (querying all sentinel nodes concurrently) because there's no cached sentinel client to use.

With replicas present this code path is never reached — replicaAddrs() returns the cached list and does an async refresh. The bug only manifests on master-only clusters.

Changes

File Change
sentinel.go Replace closeSentinel() with return []string{}, nil in the zero-replica branch
export_test.go Add test helpers to inspect sentinelFailover internal state
sentinel_test.go Add TestSentinelFailover_ReplicaAddrs_NoReplicas verifying sentinel survives repeated calls with zero replicas

Test plan

  • TestSentinelFailover_ReplicaAddrs_NoReplicas — confirms sentinel stays alive after ReplicaAddrs returns empty list
  • Existing sentinel tests continue to pass (no behavioral change for setups with replicas)
  • Verified in production at Meesho on master-only sentinel clusters — rediscovery loop eliminated

🤖 Generated with Claude Code


Note

Medium Risk
Changes failover client connection lifecycle for master-only sentinel configs; low blast radius for clusters that already have replicas, but affects every replica lookup on zero-replica topologies.

Overview
Fixes sentinel failover so an empty replica list with no error no longer tears down the cached sentinel client. In replicaAddrs, that path now returns []string{}, nil when useDisconnected is false, keeping master discovery and +switch-master pub/sub alive on master-only deployments. When useDisconnected is true, behavior is unchanged: the sentinel is still closed so the discovery loop can hunt disconnected replicas.

Test-only exports in export_test.go (HasSentinel, NewTestSentinelFailover, ReplicaAddrs) support a new integration test that monitors a master with zero replicas and asserts the sentinel stays connected across repeated ReplicaAddrs calls.

Reviewed by Cursor Bugbot for commit 1441d62. Bugbot is set up for automated code reviews on this repo. Configure here.

When a sentinel-monitored Redis master has no replicas (master-only
setup), replicaAddrss() was calling closeSentinel() on receiving an
empty replica list without error. This destroyed the sentinel
connection (and its pub/sub failover subscription), forcing a full
sentinel rediscovery on every subsequent operation — creating a
continuous rediscovery loop that flooded logs and added latency.

The empty-replica case is a valid, stable state for master-only
deployments. The sentinel connection must be preserved for ongoing
master discovery and failover monitoring via pub/sub.

The fix returns early with an empty slice instead of tearing down
the sentinel. A test is added to verify the sentinel survives
repeated ReplicaAddrs calls when no replicas exist.

Fixes: sentinel rediscovery loop on master-only clusters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jit-ci

jit-ci Bot commented Apr 27, 2026

Copy link
Copy Markdown

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4620700cac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread sentinel.go
Comment thread export_test.go Outdated
Comment thread sentinel.go
Address review feedback: the early return for zero replicas must only
apply when useDisconnected=false (the normal path). When
useDisconnected=true, fall through to the discovery loop which
correctly passes useDisconnected to parseReplicaAddrs — since
getReplicaAddrs hardcodes false, the cached sentinel path can never
find disconnected replicas.

Also remove unused SetSentinelForTest helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread sentinel_test.go
shahyash2609 and others added 2 commits April 27, 2026 22:20
Address review feedback: the previous test used sentinelName which
has two replicas in the test environment, so getReplicaAddrs returned
a non-empty list and the zero-replica code path was never exercised.

Now the test registers a temporary master-only name via SENTINEL
MONITOR (pointing to the existing master port but with a new name
and no replicas), verifies that ReplicaAddrs returns empty without
closing the sentinel, and cleans up via SENTINEL REMOVE.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a connectivity check at the start of the test so it is skipped
(not failed) in environments where sentinel processes are not running.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit a95c9f1. Configure here.

Comment thread sentinel_test.go Outdated
shahyash2609 and others added 3 commits April 27, 2026 23:43
Point SENTINEL MONITOR at the standalone Redis instance (redisPort)
instead of sentinelMasterPort, which already has two replicas. This
prevents Sentinel from discovering replicas via INFO REPLICATION and
ensures the test reliably exercises the zero-replica code path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shahyash2609

Copy link
Copy Markdown
Contributor Author

@ofekshenawa @ndyakov Can you please review this PR?

@ndyakov

ndyakov commented Apr 29, 2026

Copy link
Copy Markdown
Member

@shahyash2609 thank you for pinging us, will check it as soon as possible.

@shahyash2609

Copy link
Copy Markdown
Contributor Author

Hi @ndyakov, Hope you are doing well.
Can you pleaes review this once you get time?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Sentinel failover behavior for master-only deployments by preserving the cached Sentinel connection when replica discovery returns an empty list without error.

Changes:

  • Updates sentinelFailover.replicaAddrs to return an empty replica list without closing Sentinel for the normal replica lookup path.
  • Adds test-only helpers for inspecting and exercising sentinelFailover.
  • Adds a regression test for zero-replica Sentinel-monitored masters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
sentinel.go Adjusts zero-replica handling in Sentinel replica discovery.
export_test.go Adds test helpers for sentinelFailover state and replica lookup.
sentinel_test.go Adds regression coverage for master-only Sentinel setups.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sentinel.go
@ndyakov ndyakov added the bug label May 27, 2026

@ndyakov ndyakov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @shahyash2609, if you want you can have a followup to validate that UseDisconnectedReplicas is not set for master only OR that it works as expected without tearing down the sentinel.

@ndyakov ndyakov merged commit 875ce21 into redis:master May 27, 2026
40 checks passed
saschazepter pushed a commit to saschazepter/forgejo that referenced this pull request May 29, 2026
This PR contains the following updates:

| Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [github.com/redis/go-redis/v9](https://github.com/redis/go-redis) | `v9.19.0` → `v9.20.0` | ![age](https://developer.mend.io/api/mc/badges/age/go/github.com%2fredis%2fgo-redis%2fv9/v9.20.0?slim=true) | ![confidence](https://developer.mend.io/api/mc/badges/confidence/go/github.com%2fredis%2fgo-redis%2fv9/v9.19.0/v9.20.0?slim=true) |

---

### Release Notes

<details>
<summary>redis/go-redis (github.com/redis/go-redis/v9)</summary>

### [`v9.20.0`](https://github.com/redis/go-redis/releases/tag/v9.20.0): 9.20.0

[Compare Source](redis/go-redis@v9.19.0...v9.20.0)

#### 🚀 Highlights

##### Redis 8.8 Support

This release adds support for **Redis 8.8**. The README's supported-versions list now includes Redis 8.8 alongside 8.0/8.2/8.4, and CI exercises the `8.8` client-libs-test image across the full suite (Makefile, build workflow, doctests, run-tests action, and docker-compose).

Coverage for the new commands that ship in the 8.x line, rounded out in this release:

- **`AR*` array data type** ([#&#8203;3813](redis/go-redis#3813)) — new array data structure, exposed via the `ArrayCmdable` interface (see the experimental-features highlight below).
- **`INCREX`** ([#&#8203;3816](redis/go-redis#3816)) — atomic increment with expiration in a single round-trip.
- **`XNACK`** ([#&#8203;3790](redis/go-redis#3790)) — explicit negative-acknowledge of pending stream entries.
- **`XAUTOCLAIM` PEL deletes** ([#&#8203;3798](redis/go-redis#3798)) — `XAUTOCLAIM`/`XAUTOCLAIMJUSTID` now return the list of deleted message IDs from the pending entries list.
- **`TS.RANGE` multiple aggregators** ([#&#8203;3791](redis/go-redis#3791)) — `TS.RANGE`/`TS.REVRANGE`/`TS.MRANGE`/`TS.MREVRANGE` accept multiple aggregators in a single call.
- **`Z(UNION|INTER|DIFF)` `COUNT` aggregator** ([#&#8203;3802](redis/go-redis#3802)) — `COUNT` reducer for sorted-set set operations.
- **`JSON.SET FPHA`** ([#&#8203;3797](redis/go-redis#3797)) — new `FPHA` argument that specifies the floating-point type for homogeneous FP arrays.

CI image bump ([#&#8203;3814](redis/go-redis#3814)) by [@&#8203;ofekshenawa](https://github.com/ofekshenawa). Command coverage contributions by [@&#8203;cxljs](https://github.com/cxljs), [@&#8203;elena-kolevska](https://github.com/elena-kolevska), [@&#8203;Khukharr](https://github.com/Khukharr), [@&#8203;ndyakov](https://github.com/ndyakov), and [@&#8203;ofekshenawa](https://github.com/ofekshenawa).

##### Stable RESP3 for RediSearch (`UnstableResp3` deprecated)

`FT.SEARCH`, `FT.AGGREGATE`, `FT.INFO`, `FT.SPELLCHECK`, and `FT.SYNDUMP` now parse RESP3 (map) responses into the same typed result objects as RESP2 — `Val()` and `Result()` work uniformly on both protocols, no flag required. Previously, RESP3 search responses required `UnstableResp3: true` and were returned as opaque maps accessible only via `RawResult()` / `RawVal()`.

As a result, the `UnstableResp3` option is now a **no-op** across every options struct (`Options`, `ClusterOptions`, `UniversalOptions`, `FailoverOptions`, `RingOptions`) and has been marked `// Deprecated:`. The field is retained for backwards compatibility — existing code that sets `UnstableResp3: true` will continue to compile and behave identically — but it will be removed in a future release and new code should not set it. `RawResult()` / `RawVal()` continue to work for callers that prefer the raw RESP payload.

([#&#8203;3741](redis/go-redis#3741)) by [@&#8203;ndyakov](https://github.com/ndyakov)

##### Experimental Array Data Structure Commands

Adds an experimental `ArrayCmdable` interface with the `AR*` command family (`ARSet`, `ARGet`, `ARGetRange`, `ARMSet`, `ARMGet`, `ARDel`, `ARDelRange`, `ARScan`, `ARSeek`, `ARNext`, `ARLastItems`, `ARGrep`, `ARGrepWithValues`, `ARInfo`/`ARInfoFull`, and typed reducers `AROpSum`/`AROpMin`/`AROpMax`/`AROpAnd`/`AROpOr`/`AROpXor`/`AROpMatch`/`AROpUsed`) for working with Redis 8.8's new array data type. **API is experimental and may change in a future release.**

([#&#8203;3813](redis/go-redis#3813)) by [@&#8203;cxljs](https://github.com/cxljs)

#### ✨ New Features

- **RESP3 search parser**: First-class RESP3 parsing for `FT.SEARCH`/`FT.AGGREGATE`/`FT.INFO`/`FT.SPELLCHECK`/`FT.SYNDUMP` responses with backwards compatibility for RESP2 ([#&#8203;3741](redis/go-redis#3741)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **INCREX**: New `INCREX` command support — atomic increment with expiration ([#&#8203;3816](redis/go-redis#3816)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **XNACK**: Client support for the `XNACK` stream command for explicitly negative-acknowledging pending entries ([#&#8203;3790](redis/go-redis#3790)) by [@&#8203;elena-kolevska](https://github.com/elena-kolevska)
- **TS range multiple aggregators**: `TS.RANGE`/`TS.REVRANGE`/`TS.MRANGE`/`TS.MREVRANGE` now accept multiple aggregators in a single call ([#&#8203;3791](redis/go-redis#3791)) by [@&#8203;elena-kolevska](https://github.com/elena-kolevska)
- **`XAutoClaim` deleted IDs**: `XAUTOCLAIM`/`XAUTOCLAIMJUSTID` now return the list of deleted message IDs from the PEL ([#&#8203;3798](redis/go-redis#3798)) by [@&#8203;Khukharr](https://github.com/Khukharr)
- **`JSON.SET FPHA`**: `JSON.SET` accepts a new `FPHA` argument that specifies the floating-point type for homogeneous floating-point arrays ([#&#8203;3797](redis/go-redis#3797)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **Sorted-set union/intersection COUNT**: `ZUNION`/`ZINTER`/`ZDIFF` aggregator now supports `COUNT` ([#&#8203;3802](redis/go-redis#3802)) by [@&#8203;ofekshenawa](https://github.com/ofekshenawa)
- **`FT.HYBRID` vector validation**: Validates hybrid-search vector input types and adds proper typed vector parameters ([#&#8203;3756](redis/go-redis#3756)) by [@&#8203;DengY11](https://github.com/DengY11)
- **Cluster pool wait stats**: `ClusterClient.PoolStats()` now accumulates `WaitCount` and `WaitDurationNs` across all node pools (previously always zero) ([#&#8203;3809](redis/go-redis#3809)) by [@&#8203;LINKIWI](https://github.com/LINKIWI)

#### 🐛 Bug Fixes

- **TLS-only Cluster PubSub**: `CLUSTER SLOTS` port-0 entries now fall back to the origin endpoint's port, fixing `dial tcp <ip>:0: connection refused` on TLS-only clusters started with `--port 0 --tls-port <port>` (fixes [#&#8203;3726](redis/go-redis#3726)) ([#&#8203;3828](redis/go-redis#3828)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **Sharded PubSub reconnect routing**: `PubSub.conn()` now passes both regular (`c.channels`) and sharded (`c.schannels`) channels into the per-PubSub `newConn` closure. Previously, `ClusterClient.SSubscribe`-only PubSubs reconnected to a random node (because the routing closure saw an empty channel list), the `SSUBSCRIBE` was sent to the wrong shard, and the resulting `MOVED` reply was silently dropped ([#&#8203;3829](redis/go-redis#3829)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **ClusterClient `Watch` retry**: User errors returned from a `Watch` callback are no longer subjected to cluster-retry classification; transient cluster errors still retry, but a callback returning e.g. `net.ErrClosed` short-circuits immediately ([#&#8203;3821](redis/go-redis#3821)) by [@&#8203;obiyang](https://github.com/obiyang)
- **Sentinel concurrent-probe leak**: `MasterAddr`'s concurrent sentinel probe now closes the non-winning sentinel clients instead of leaking them ([#&#8203;3827](redis/go-redis#3827)) by [@&#8203;cxljs](https://github.com/cxljs)
- **Sentinel rediscovery loop on master-only setups**: `replicaAddrs` no longer tears down the cached sentinel client when the replica list is empty, eliminating a continuous rediscovery loop on master-only Sentinel deployments that flooded logs and added per-operation latency ([#&#8203;3795](redis/go-redis#3795)) by [@&#8203;shahyash2609](https://github.com/shahyash2609)
- **Pool `CloseConn` hooks**: `Pool.CloseConn` now triggers registered hooks, fixing a memory leak when connections are closed explicitly rather than via the normal removal path ([#&#8203;3818](redis/go-redis#3818)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **Dial TCP error redirection**: Wrapped `dial tcp` errors are now correctly classified as redirectable so cluster routing can recover from a single unreachable node ([#&#8203;3810](redis/go-redis#3810)) by [@&#8203;vladisa88](https://github.com/vladisa88)
- **Pool `Close` health checks**: `ConnPool.Close` now only runs health checks against idle connections, avoiding spurious activity on connections still in use ([#&#8203;3805](redis/go-redis#3805)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **VLinks return type**: Fixed the return type of `VLINKS`/`VLINKSWITHSCORES` vector-set replies ([#&#8203;3820](redis/go-redis#3820)) by [@&#8203;romanpovol](https://github.com/romanpovol)

#### 🧪 Testing & Infrastructure

- **Flaky tests**: Stabilized several flaky tests in the sentinel and pool suites ([#&#8203;3815](redis/go-redis#3815)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **Sentinel failover metric race**: Fixed a data race in the sentinel failover metric test ([#&#8203;3824](redis/go-redis#3824)) by [@&#8203;cxljs](https://github.com/cxljs)
- **`waitForSentinelClusterStable` post-conditions**: The sentinel test harness now waits for replicas to be fully connected (not just present in the count) and is robust to randomized spec ordering after failover specs, eliminating an intermittent `Expected master to equal slave` flake ([#&#8203;3830](redis/go-redis#3830)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **`govulncheck` workflow**: New scheduled GitHub Actions workflow runs `govulncheck` on every push, PR, and weekly, surfacing newly disclosed Go vulnerabilities even when no code changes ([#&#8203;3779](redis/go-redis#3779)) by [@&#8203;solardome](https://github.com/solardome)
- **CI Redis 8.8-rc1**: CI now exercises the 8.8-rc1 Redis image ([#&#8203;3814](redis/go-redis#3814)) by [@&#8203;ofekshenawa](https://github.com/ofekshenawa)

#### 🧰 Maintenance

- **`Cmd.Slot()` lookup refactor**: Caches the per-command `CommandInfo` and short-circuits keyless commands before the switch dispatch, removing redundant `Peek` calls ([#&#8203;3804](redis/go-redis#3804)) by [@&#8203;retr0-kernel](https://github.com/retr0-kernel)
- **stdlib `math/rand`**: Replaced `internal/rand` with `math/rand` from the standard library now that the minimum Go version is 1.24 ([#&#8203;3823](redis/go-redis#3823)) by [@&#8203;cxljs](https://github.com/cxljs)
- **ConnPool queue channel**: Removed the unused queue channel from `ConnPool`, trimming the pool's footprint ([#&#8203;3826](redis/go-redis#3826)) by [@&#8203;cxljs](https://github.com/cxljs)
- **Extra packages LICENSE**: Added a LICENSE file to each `extra/*` package ([#&#8203;3817](redis/go-redis#3817)) by [@&#8203;ndyakov](https://github.com/ndyakov)
- **README & CI image**: Documentation refresh and bumped the default CI image tag ([#&#8203;3822](redis/go-redis#3822)) by [@&#8203;ndyakov](https://github.com/ndyakov)

#### 👥 Contributors

We'd like to thank all the contributors who worked on this release!

[@&#8203;cxljs](https://github.com/cxljs), [@&#8203;DengY11](https://github.com/DengY11), [@&#8203;elena-kolevska](https://github.com/elena-kolevska), [@&#8203;Khukharr](https://github.com/Khukharr), [@&#8203;LINKIWI](https://github.com/LINKIWI), [@&#8203;ndyakov](https://github.com/ndyakov), [@&#8203;obiyang](https://github.com/obiyang), [@&#8203;ofekshenawa](https://github.com/ofekshenawa), [@&#8203;retr0-kernel](https://github.com/retr0-kernel), [@&#8203;romanpovol](https://github.com/romanpovol), [@&#8203;shahyash2609](https://github.com/shahyash2609), [@&#8203;solardome](https://github.com/solardome), [@&#8203;vladisa88](https://github.com/vladisa88)

***

**Full Changelog**: <redis/go-redis@v9.19.0...v9.20.0>

</details>

---

### Configuration

📅 **Schedule**: (UTC)

- Branch creation
  - Between 12:00 AM and 03:59 AM (`* 0-3 * * *`)
- Automerge
  - Between 12:00 AM and 03:59 AM (`* 0-3 * * *`)

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Mend Renovate](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xOTUuMSIsInVwZGF0ZWRJblZlciI6IjQzLjE5NS4xIiwidGFyZ2V0QnJhbmNoIjoiZm9yZ2VqbyIsImxhYmVscyI6WyJkZXBlbmRlbmN5LXVwZ3JhZGUiLCJ0ZXN0L25vdC1uZWVkZWQiXX0=-->

Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/12804
Reviewed-by: Mathieu Fenniak <mfenniak@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants