fix(test): make waitForSentinelClusterStable robust to disconnected r…#3830
Merged
ndyakov merged 5 commits intoMay 27, 2026
Merged
Conversation
…eplicas
CI on master sporadically fails the universal_test.go spec
"should connect to failover servers on slaves when readonly Options is ok"
with `Expected master to equal slave`. The flake reproduces only when the
random spec order puts the sentinel-failover Describe block before the
universal_test spec.
Root cause is in waitForSentinelClusterStable's post-conditions, not in
production code:
1. It only checks `len(replicas) >= 2`. Production
(parseReplicaAddrs in sentinel.go) filters out replicas flagged
`s_down`, `o_down`, or `disconnected`. Post-failover, replicas
often linger as `disconnected` for several seconds after Sentinel
agrees on the master/replica counts, so the stabilization check
can pass while production sees zero usable replicas.
2. It then waits a fixed 10s. With no link between that sleep and
the actual precondition (a FailoverClient(ReadOnly:true) finding
a routable replica), the test gambles that 10s is enough on every
CI host. When it isn't, RandomReplicaAddr silently returns the
master and the spec asserts on the wrong role.
Two changes:
- Add countUsableReplicas, mirroring parseReplicaAddrs's filter, and
require each sentinel to report at least 2 usable (non-down,
non-disconnected) replicas with consistent counts across sentinels.
- Replace the 10s sleep with a bounded Eventually that opens the same
kind of client the failing spec uses (UniversalClient + MasterName +
ReadOnly:true) and confirms ROLE returns "slave". This is the
deterministic equivalent of waiting "long enough" — it returns as
soon as the actual precondition holds, and bounds at 30s if the
cluster never stabilizes.
No production code changes; the flake is entirely in test harness
post-conditions. The new probe uses only public API and existing
package-level fixtures (sentinelName, sentinelAddrs).
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Contributor
There was a problem hiding this comment.
Pull request overview
Hardens the Sentinel test harness’ waitForSentinelClusterStable to prevent CI flakes where a ReadOnly failover client temporarily routes to the master because Sentinels still report replicas as disconnected even after master/replica counts look “stable”.
Changes:
- Add
countUsableReplicasto count only replicas that would be considered usable by production filtering (s_down,o_down,disconnected, missingip/port). - Strengthen stabilization criteria to require each sentinel to report at least 2 usable replicas with consistent usable counts.
- Replace a fixed
time.Sleep(10s)with a boundedEventuallyprobe that creates aUniversalClientwithReadOnly: trueand verifiesROLEreportsslave.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ofekshenawa
previously approved these changes
May 27, 2026
CI on master sporadically fails the universal_test.go spec
"should connect to failover servers on slaves when readonly Options is ok"
with `Expected master to equal slave`. Two random-spec-ordering bug
paths combine to produce this:
1. sentinel_test.go has three failover-causing specs across three
Describe blocks ("Sentinel", "NewFailoverClusterClient", and
"Sentinel Failover with Conns"). Only the third runs Ordered/Serial
with an AfterAll that calls waitForSentinelClusterStable. The
first two can land right before the universal_test spec under any
random spec ordering and leave the cluster mid-recovery.
2. FailoverClient(ReadOnly:true) routes via RandomReplicaAddr, which
silently falls back to the master when Sentinel reports zero
usable replicas. parseReplicaAddrs in sentinel.go filters out
replicas flagged s_down, o_down, or disconnected — and freshly
promoted/demoted replicas often linger as "disconnected" for
several seconds after Sentinel agrees on master/replica counts.
Two test-only changes (no production code touched):
a) waitForSentinelClusterStable: previously only checked
len(replicas) >= 2, which can pass while every replica is
filtered out by production code. Now mirrors parseReplicaAddrs's
filter via countUsableReplicas and requires each sentinel to
report >= 2 usable replicas with consistent counts across
sentinels. The fixed 10s sleep is replaced with a bounded
Eventually that opens the same UniversalClient(ReadOnly:true)
the failing spec uses and confirms ROLE returns "slave" — the
deterministic equivalent of "wait long enough".
b) universal_test.go spec: defense in depth. Since the failover
specs in (1) don't all share an AfterAll, the spec itself
retries client creation + ROLE in a bounded Eventually. Each
iteration closes the old client and creates a new one so
RandomReplicaAddr is re-evaluated against current Sentinel
state. Bounded at 30s so a genuinely broken setup still fails.
The (a) fix keeps the post-failover state clean for any downstream
test; the (b) fix keeps this specific spec correct regardless of
ordering. Together they make the spec deterministic across all 1414!
possible random spec orderings.
Previous attempt fixed the ROLE check ("Expected master to equal slave")
by retrying client creation inside Eventually until ROLE returned
"slave". CI surfaced a second failure mode in the same spec:
[FAILED] Expected an error to have occurred. Got: <nil>: nil
universal_test.go:110
The spec's intent was: ReadOnly client lands on a replica, and a write
against that replica returns READONLY. The new failure means the write
succeeded — there's a second path back to the master:
SET on replica → READONLY error
→ shouldRetry returns true for READONLY (error.go)
→ conn discarded, retried via masterReplicaDialer
→ RandomReplicaAddr called again, may silently fall back to master
when no usable replica is reported (same root cause as the original
flake — Sentinel sees 2 replicas but both are "disconnected")
→ SET hits the master, succeeds, err is nil
→ Expect(err).To(HaveOccurred()) fails
Two changes to make the spec atomic and bypass-proof:
- Set MaxRetries: -1 on the test client. A READONLY reply now
surfaces as the first/only SET error instead of triggering a
retry-dial that may pick the master. This isolates what the test
is actually asserting: a single round-trip to whatever node the
dialer picked.
- Move the SET assertion inside the Eventually closure together
with the ROLE check. If either step lands on the master (because
RandomReplicaAddr fell back at dial time), the closure returns an
error, Eventually closes the client and retries with a fresh one
so the dialer re-queries Sentinel. The full assertion is therefore
atomic — both ROLE and SET observe the same dial.
Error messages from the closure name the precise failure mode so a
genuinely broken setup (timeout after 30s) reports something more
useful than "expected slave, got master".
No production code changes; the retry-on-READONLY behaviour is correct
for normal failover scenarios and stays untouched.
ofekshenawa
approved these changes
May 27, 2026
saschazepter
pushed a commit
to saschazepter/forgejo
that referenced
this pull request
May 29, 2026
This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [github.com/redis/go-redis/v9](https://github.com/redis/go-redis) | `v9.19.0` → `v9.20.0` |  |  | --- ### Release Notes <details> <summary>redis/go-redis (github.com/redis/go-redis/v9)</summary> ### [`v9.20.0`](https://github.com/redis/go-redis/releases/tag/v9.20.0): 9.20.0 [Compare Source](redis/go-redis@v9.19.0...v9.20.0) #### 🚀 Highlights ##### Redis 8.8 Support This release adds support for **Redis 8.8**. The README's supported-versions list now includes Redis 8.8 alongside 8.0/8.2/8.4, and CI exercises the `8.8` client-libs-test image across the full suite (Makefile, build workflow, doctests, run-tests action, and docker-compose). Coverage for the new commands that ship in the 8.x line, rounded out in this release: - **`AR*` array data type** ([#​3813](redis/go-redis#3813)) — new array data structure, exposed via the `ArrayCmdable` interface (see the experimental-features highlight below). - **`INCREX`** ([#​3816](redis/go-redis#3816)) — atomic increment with expiration in a single round-trip. - **`XNACK`** ([#​3790](redis/go-redis#3790)) — explicit negative-acknowledge of pending stream entries. - **`XAUTOCLAIM` PEL deletes** ([#​3798](redis/go-redis#3798)) — `XAUTOCLAIM`/`XAUTOCLAIMJUSTID` now return the list of deleted message IDs from the pending entries list. - **`TS.RANGE` multiple aggregators** ([#​3791](redis/go-redis#3791)) — `TS.RANGE`/`TS.REVRANGE`/`TS.MRANGE`/`TS.MREVRANGE` accept multiple aggregators in a single call. - **`Z(UNION|INTER|DIFF)` `COUNT` aggregator** ([#​3802](redis/go-redis#3802)) — `COUNT` reducer for sorted-set set operations. - **`JSON.SET FPHA`** ([#​3797](redis/go-redis#3797)) — new `FPHA` argument that specifies the floating-point type for homogeneous FP arrays. CI image bump ([#​3814](redis/go-redis#3814)) by [@​ofekshenawa](https://github.com/ofekshenawa). Command coverage contributions by [@​cxljs](https://github.com/cxljs), [@​elena-kolevska](https://github.com/elena-kolevska), [@​Khukharr](https://github.com/Khukharr), [@​ndyakov](https://github.com/ndyakov), and [@​ofekshenawa](https://github.com/ofekshenawa). ##### Stable RESP3 for RediSearch (`UnstableResp3` deprecated) `FT.SEARCH`, `FT.AGGREGATE`, `FT.INFO`, `FT.SPELLCHECK`, and `FT.SYNDUMP` now parse RESP3 (map) responses into the same typed result objects as RESP2 — `Val()` and `Result()` work uniformly on both protocols, no flag required. Previously, RESP3 search responses required `UnstableResp3: true` and were returned as opaque maps accessible only via `RawResult()` / `RawVal()`. As a result, the `UnstableResp3` option is now a **no-op** across every options struct (`Options`, `ClusterOptions`, `UniversalOptions`, `FailoverOptions`, `RingOptions`) and has been marked `// Deprecated:`. The field is retained for backwards compatibility — existing code that sets `UnstableResp3: true` will continue to compile and behave identically — but it will be removed in a future release and new code should not set it. `RawResult()` / `RawVal()` continue to work for callers that prefer the raw RESP payload. ([#​3741](redis/go-redis#3741)) by [@​ndyakov](https://github.com/ndyakov) ##### Experimental Array Data Structure Commands Adds an experimental `ArrayCmdable` interface with the `AR*` command family (`ARSet`, `ARGet`, `ARGetRange`, `ARMSet`, `ARMGet`, `ARDel`, `ARDelRange`, `ARScan`, `ARSeek`, `ARNext`, `ARLastItems`, `ARGrep`, `ARGrepWithValues`, `ARInfo`/`ARInfoFull`, and typed reducers `AROpSum`/`AROpMin`/`AROpMax`/`AROpAnd`/`AROpOr`/`AROpXor`/`AROpMatch`/`AROpUsed`) for working with Redis 8.8's new array data type. **API is experimental and may change in a future release.** ([#​3813](redis/go-redis#3813)) by [@​cxljs](https://github.com/cxljs) #### ✨ New Features - **RESP3 search parser**: First-class RESP3 parsing for `FT.SEARCH`/`FT.AGGREGATE`/`FT.INFO`/`FT.SPELLCHECK`/`FT.SYNDUMP` responses with backwards compatibility for RESP2 ([#​3741](redis/go-redis#3741)) by [@​ndyakov](https://github.com/ndyakov) - **INCREX**: New `INCREX` command support — atomic increment with expiration ([#​3816](redis/go-redis#3816)) by [@​ndyakov](https://github.com/ndyakov) - **XNACK**: Client support for the `XNACK` stream command for explicitly negative-acknowledging pending entries ([#​3790](redis/go-redis#3790)) by [@​elena-kolevska](https://github.com/elena-kolevska) - **TS range multiple aggregators**: `TS.RANGE`/`TS.REVRANGE`/`TS.MRANGE`/`TS.MREVRANGE` now accept multiple aggregators in a single call ([#​3791](redis/go-redis#3791)) by [@​elena-kolevska](https://github.com/elena-kolevska) - **`XAutoClaim` deleted IDs**: `XAUTOCLAIM`/`XAUTOCLAIMJUSTID` now return the list of deleted message IDs from the PEL ([#​3798](redis/go-redis#3798)) by [@​Khukharr](https://github.com/Khukharr) - **`JSON.SET FPHA`**: `JSON.SET` accepts a new `FPHA` argument that specifies the floating-point type for homogeneous floating-point arrays ([#​3797](redis/go-redis#3797)) by [@​ndyakov](https://github.com/ndyakov) - **Sorted-set union/intersection COUNT**: `ZUNION`/`ZINTER`/`ZDIFF` aggregator now supports `COUNT` ([#​3802](redis/go-redis#3802)) by [@​ofekshenawa](https://github.com/ofekshenawa) - **`FT.HYBRID` vector validation**: Validates hybrid-search vector input types and adds proper typed vector parameters ([#​3756](redis/go-redis#3756)) by [@​DengY11](https://github.com/DengY11) - **Cluster pool wait stats**: `ClusterClient.PoolStats()` now accumulates `WaitCount` and `WaitDurationNs` across all node pools (previously always zero) ([#​3809](redis/go-redis#3809)) by [@​LINKIWI](https://github.com/LINKIWI) #### 🐛 Bug Fixes - **TLS-only Cluster PubSub**: `CLUSTER SLOTS` port-0 entries now fall back to the origin endpoint's port, fixing `dial tcp <ip>:0: connection refused` on TLS-only clusters started with `--port 0 --tls-port <port>` (fixes [#​3726](redis/go-redis#3726)) ([#​3828](redis/go-redis#3828)) by [@​ndyakov](https://github.com/ndyakov) - **Sharded PubSub reconnect routing**: `PubSub.conn()` now passes both regular (`c.channels`) and sharded (`c.schannels`) channels into the per-PubSub `newConn` closure. Previously, `ClusterClient.SSubscribe`-only PubSubs reconnected to a random node (because the routing closure saw an empty channel list), the `SSUBSCRIBE` was sent to the wrong shard, and the resulting `MOVED` reply was silently dropped ([#​3829](redis/go-redis#3829)) by [@​ndyakov](https://github.com/ndyakov) - **ClusterClient `Watch` retry**: User errors returned from a `Watch` callback are no longer subjected to cluster-retry classification; transient cluster errors still retry, but a callback returning e.g. `net.ErrClosed` short-circuits immediately ([#​3821](redis/go-redis#3821)) by [@​obiyang](https://github.com/obiyang) - **Sentinel concurrent-probe leak**: `MasterAddr`'s concurrent sentinel probe now closes the non-winning sentinel clients instead of leaking them ([#​3827](redis/go-redis#3827)) by [@​cxljs](https://github.com/cxljs) - **Sentinel rediscovery loop on master-only setups**: `replicaAddrs` no longer tears down the cached sentinel client when the replica list is empty, eliminating a continuous rediscovery loop on master-only Sentinel deployments that flooded logs and added per-operation latency ([#​3795](redis/go-redis#3795)) by [@​shahyash2609](https://github.com/shahyash2609) - **Pool `CloseConn` hooks**: `Pool.CloseConn` now triggers registered hooks, fixing a memory leak when connections are closed explicitly rather than via the normal removal path ([#​3818](redis/go-redis#3818)) by [@​ndyakov](https://github.com/ndyakov) - **Dial TCP error redirection**: Wrapped `dial tcp` errors are now correctly classified as redirectable so cluster routing can recover from a single unreachable node ([#​3810](redis/go-redis#3810)) by [@​vladisa88](https://github.com/vladisa88) - **Pool `Close` health checks**: `ConnPool.Close` now only runs health checks against idle connections, avoiding spurious activity on connections still in use ([#​3805](redis/go-redis#3805)) by [@​ndyakov](https://github.com/ndyakov) - **VLinks return type**: Fixed the return type of `VLINKS`/`VLINKSWITHSCORES` vector-set replies ([#​3820](redis/go-redis#3820)) by [@​romanpovol](https://github.com/romanpovol) #### 🧪 Testing & Infrastructure - **Flaky tests**: Stabilized several flaky tests in the sentinel and pool suites ([#​3815](redis/go-redis#3815)) by [@​ndyakov](https://github.com/ndyakov) - **Sentinel failover metric race**: Fixed a data race in the sentinel failover metric test ([#​3824](redis/go-redis#3824)) by [@​cxljs](https://github.com/cxljs) - **`waitForSentinelClusterStable` post-conditions**: The sentinel test harness now waits for replicas to be fully connected (not just present in the count) and is robust to randomized spec ordering after failover specs, eliminating an intermittent `Expected master to equal slave` flake ([#​3830](redis/go-redis#3830)) by [@​ndyakov](https://github.com/ndyakov) - **`govulncheck` workflow**: New scheduled GitHub Actions workflow runs `govulncheck` on every push, PR, and weekly, surfacing newly disclosed Go vulnerabilities even when no code changes ([#​3779](redis/go-redis#3779)) by [@​solardome](https://github.com/solardome) - **CI Redis 8.8-rc1**: CI now exercises the 8.8-rc1 Redis image ([#​3814](redis/go-redis#3814)) by [@​ofekshenawa](https://github.com/ofekshenawa) #### 🧰 Maintenance - **`Cmd.Slot()` lookup refactor**: Caches the per-command `CommandInfo` and short-circuits keyless commands before the switch dispatch, removing redundant `Peek` calls ([#​3804](redis/go-redis#3804)) by [@​retr0-kernel](https://github.com/retr0-kernel) - **stdlib `math/rand`**: Replaced `internal/rand` with `math/rand` from the standard library now that the minimum Go version is 1.24 ([#​3823](redis/go-redis#3823)) by [@​cxljs](https://github.com/cxljs) - **ConnPool queue channel**: Removed the unused queue channel from `ConnPool`, trimming the pool's footprint ([#​3826](redis/go-redis#3826)) by [@​cxljs](https://github.com/cxljs) - **Extra packages LICENSE**: Added a LICENSE file to each `extra/*` package ([#​3817](redis/go-redis#3817)) by [@​ndyakov](https://github.com/ndyakov) - **README & CI image**: Documentation refresh and bumped the default CI image tag ([#​3822](redis/go-redis#3822)) by [@​ndyakov](https://github.com/ndyakov) #### 👥 Contributors We'd like to thank all the contributors who worked on this release! [@​cxljs](https://github.com/cxljs), [@​DengY11](https://github.com/DengY11), [@​elena-kolevska](https://github.com/elena-kolevska), [@​Khukharr](https://github.com/Khukharr), [@​LINKIWI](https://github.com/LINKIWI), [@​ndyakov](https://github.com/ndyakov), [@​obiyang](https://github.com/obiyang), [@​ofekshenawa](https://github.com/ofekshenawa), [@​retr0-kernel](https://github.com/retr0-kernel), [@​romanpovol](https://github.com/romanpovol), [@​shahyash2609](https://github.com/shahyash2609), [@​solardome](https://github.com/solardome), [@​vladisa88](https://github.com/vladisa88) *** **Full Changelog**: <redis/go-redis@v9.19.0...v9.20.0> </details> --- ### Configuration 📅 **Schedule**: (UTC) - Branch creation - Between 12:00 AM and 03:59 AM (`* 0-3 * * *`) - Automerge - Between 12:00 AM and 03:59 AM (`* 0-3 * * *`) 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xOTUuMSIsInVwZGF0ZWRJblZlciI6IjQzLjE5NS4xIiwidGFyZ2V0QnJhbmNoIjoiZm9yZ2VqbyIsImxhYmVscyI6WyJkZXBlbmRlbmN5LXVwZ3JhZGUiLCJ0ZXN0L25vdC1uZWVkZWQiXX0=--> Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/12804 Reviewed-by: Mathieu Fenniak <mfenniak@noreply.codeberg.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…eplicas
CI on master sporadically fails the universal_test.go spec "should connect to failover servers on slaves when readonly Options is ok" with
Expected master to equal slave. The flake reproduces only when the random spec order puts the sentinel-failover Describe block before the universal_test spec.Root cause is in waitForSentinelClusterStable's post-conditions, not in production code:
It only checks
len(replicas) >= 2. Production (parseReplicaAddrs in sentinel.go) filters out replicas flaggeds_down,o_down, ordisconnected. Post-failover, replicas often linger asdisconnectedfor several seconds after Sentinel agrees on the master/replica counts, so the stabilization check can pass while production sees zero usable replicas.It then waits a fixed 10s. With no link between that sleep and the actual precondition (a FailoverClient(ReadOnly:true) finding a routable replica), the test gambles that 10s is enough on every CI host. When it isn't, RandomReplicaAddr silently returns the master and the spec asserts on the wrong role.
Two changes:
Add countUsableReplicas, mirroring parseReplicaAddrs's filter, and require each sentinel to report at least 2 usable (non-down, non-disconnected) replicas with consistent counts across sentinels.
Replace the 10s sleep with a bounded Eventually that opens the same kind of client the failing spec uses (UniversalClient + MasterName + ReadOnly:true) and confirms ROLE returns "slave". This is the deterministic equivalent of waiting "long enough" — it returns as soon as the actual precondition holds, and bounds at 30s if the cluster never stabilizes.
No production code changes; the flake is entirely in test harness post-conditions. The new probe uses only public API and existing package-level fixtures (sentinelName, sentinelAddrs).
Note
Low Risk
Changes are limited to test harness timing and assertions; no production Sentinel or client routing code is modified.
Overview
Hardens Sentinel integration tests so read-only failover specs stop flaking after earlier failover specs run in random order.
In
waitForSentinelClusterStable, replica readiness now matches production: a newcountUsableReplicashelper (same filtering asparseReplicaAddrs) replaces rawlen(replicas), and the fixed 10s sleep is replaced by a boundedEventuallythat opens aUniversalClientwithReadOnly: trueand waits untilROLEisslave.The
universal_testread-only spec is wrapped inEventuallywithMaxRetries: -1, recreating the client on each attempt soRandomReplicaAddrmaster fallback and READONLY retry behavior cannot mask a bad replica route.Reviewed by Cursor Bugbot for commit e7f16bd. Bugbot is set up for automated code reviews on this repo. Configure here.