Use sharded pub/sub for Redis Cluster to fix duplicate message delivery by niemyjski · Pull Request #141 · FoundatioFx/Foundatio.Redis

niemyjski · 2026-02-14T00:52:12Z

Summary

Use sharded pub/sub (SPUBLISH/SSUBSCRIBE) in Redis Cluster mode to fix duplicate message delivery
Replace Lazy<RedisChannel> with RedisChannel? nullable struct for zero-allocation lazy caching
Drop Redis < 7.0 version check — everything below 7.2 is EOL

Problem

In Redis Cluster mode, PUBLISH broadcasts messages to all nodes. StackExchange.Redis spreads Literal subscriptions across nodes, so each subscriber receives the message once per primary node (3x in a typical 3-master cluster). This caused CanReceiveMessagesConcurrentlyAsync to fail with too many signals.

Solution

When IsCluster() is true, use RedisChannel.Sharded() instead of RedisChannel.Literal(). Sharded pub/sub routes all operations for a given channel through a single shard, ensuring exactly-once delivery while preserving full fanout to all subscribers on that channel.

The channel type is resolved lazily on first access and cached via RedisChannel? _channel with ??=. This is:

Zero-allocation: RedisChannel is a struct, Nullable<RedisChannel> is also a value type
Lazy: IsCluster() is only called on first subscribe/publish, not in the constructor
Consistent: Same pattern as RedisQueue._listPrefix which caches IsCluster() once and never resets (topology does not change on reconnect)

References

Test plan

All 23 messaging tests pass against Redis Cluster (7000-7005)
Build succeeds with zero errors
Only RedisMessageBus.cs changed — no test harness or config modifications

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

This PR fixes a critical duplicate message delivery bug in RedisMessageBus when deployed against Redis Cluster. The fix switches from standard pub/sub (PUBLISH/SUBSCRIBE) to sharded pub/sub (SPUBLISH/SSUBSCRIBE) for cluster deployments, ensuring exactly-once message delivery while preserving full fanout to all subscribers.

Changes:

Implemented cluster detection using the IsCluster() extension method, consistent with other Redis components in the codebase
Introduced GetChannel() method to lazily resolve the appropriate Redis channel type (sharded for clusters, literal for standalone/sentinel)
Replaced direct RedisChannel.Literal() calls with GetChannel() in both subscription and publish code paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/Foundatio.Redis/Messaging/RedisMessageBus.cs

Replace Lazy<RedisChannel> with RedisChannel? for zero-allocation lazy caching. Drop Redis version check since < 7.2 is EOL. The nullable struct is evaluated once on first access via ??= and cached for all subsequent calls, consistent with how RedisQueue caches IsCluster() in _listPrefix. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

src/Foundatio.Redis/Messaging/RedisMessageBus.cs:63

There are existing RedisMessageBus tests, but none appear to assert the cluster-specific behavior introduced here (no duplicates in cluster, sharded channel selection when cluster + Redis >= 7, and Literal fallback otherwise). Adding an integration test that detects duplicate delivery in cluster mode would help prevent regressions of this bug fix.

src/Foundatio.Redis/Messaging/RedisMessageBus.cs:62

ResolveChannel() falls back to Literal only when it finds a connected primary with Version < 7.0, but if no primaries are connected at the moment this Lazy is first evaluated, it will default to RedisChannel.Sharded() and cache that choice. That can cause permanent SPUBLISH/SSUBSCRIBE usage (and runtime errors) against pre-7.0 clusters if the initial version probe happened before connections were established. Consider handling the “no connected primaries / version unknown” case explicitly (e.g., return Literal or avoid caching until a version can be confirmed).

            return;

        using (await _lock.LockAsync().AnyContext())
        {
            if (_isSubscribed)
                return;

            _logger.LogTrace("Subscribing to topic: {Topic}", _options.Topic);
            _channelMessageQueue = await _options.Subscriber.SubscribeAsync(Channel).AnyContext();
            _channelMessageQueue.OnMessage(OnMessage);
            _isSubscribed = true;
            _logger.LogTrace("Subscribed to topic: {Topic}", _options.Topic);
        }
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/Foundatio.Redis/Messaging/RedisMessageBus.cs

niemyjski · 2026-02-14T01:25:53Z

Verification: Bug is real and reproducible

Ran CanReceiveMessagesConcurrentlyAsync against the local Redis Cluster (3 primaries, 3 replicas, Redis 8.2.1, StackExchange.Redis 2.11.0):

Code	Result	Time
`RedisChannel.Literal` (original)	FAIL - timeout	4.5s (hit 4s timeout, not all 1000 signals received)
`RedisChannel.Sharded` (this PR)	PASS	0.447s

Root Cause: Why Sharded is ~10x faster

The test publishes 103 messages to 10 subscribers (expecting 1000 countdown signals). Here is why the routing matters:

PUBLISH (Literal) in a 3-primary cluster:

StackExchange.Redis sends PUBLISH to an arbitrary node
That node broadcasts the message via the cluster bus to ALL other nodes (5 hops per message)
The subscription lives on a single node (StackExchange.Redis subscribes to one node only for Literal channels)
The subscription node receives the message from the cluster bus after propagation
103 publishes x 5 cluster bus forwards = ~515 inter-node messages
With CommandFlags.FireAndForget, there is no backpressure - the client fires all 103 publishes as fast as possible, saturating the cluster bus
Messages queue up in inter-node propagation, some arriving after the 4-second timeout

SPUBLISH (Sharded):

The channel name is hashed to a slot owned by exactly one shard
SSUBSCRIBE connects to that same shard
Each SPUBLISH routes directly to the owning shard - no cluster bus broadcast
103 publishes -> 103 direct deliveries to one node, zero inter-node traffic
All 1000 signals complete in under 500ms

Summary

The performance difference is architectural: PUBLISH has O(messages x nodes) cluster bus overhead, SPUBLISH has O(messages) with direct routing. Under rapid FireAndForget publishing, the cluster bus becomes the bottleneck, causing message delivery delays that exceed the test timeout.

References:

PUBLISH docs: "The cluster makes sure that published messages are forwarded as needed" (cluster bus broadcast)
SPUBLISH docs: "Posts a message to the given shard channel" (direct shard routing)
StackExchange.Redis #2750: Subscribe routes to single node in cluster mode

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/Foundatio.Redis/Messaging/RedisMessageBus.cs

…e condition - Add IsRedisCluster() extension that checks for actual Redis Cluster (ServerType.Cluster) only, unlike IsCluster() which also returns true for Twemproxy and sentinel-only configs. Sharded pub/sub (SPUBLISH/ SSUBSCRIBE) is a Redis Cluster-specific feature. - Update Channel property to use IsRedisCluster() instead of IsCluster() - Improve XML docs: clarify this prevents per-primary duplicate delivery, not end-to-end exactly-once semantics - Add remarks explaining why ??= race is benign for readonly struct - Fix RemoveIfEqualAsync_WithNonMatchingValue_DoesNotPublishInvalidation test race condition: SetAsync publishes an InvalidateCache message that can arrive at secondCache after the baseline counter is captured. Fixed by draining pending invalidations before populating secondCache. Co-authored-by: Cursor <cursoragent@cursor.com>

niemyjski · 2026-02-14T01:42:02Z

Addressed Copilot Review Feedback + Fixed Build Failure

Copilot Review Comments Addressed

1. IsCluster() returns true for Twemproxy/Sentinel (valid concern)

Added new IsRedisCluster() extension method that checks specifically for ServerType.Cluster only
Unlike IsCluster() (which returns true for Twemproxy proxy and sentinel-only configs), IsRedisCluster() only returns true for actual Redis Cluster deployments where SPUBLISH/SSUBSCRIBE are supported
Updated Channel property to use IsRedisCluster() instead of IsCluster()

2. Thread safety of ??= with nullable struct

Added <remarks> explaining why the race is benign: RedisChannel is a readonly struct, and both concurrent initializations produce identical values. The worst case is calling IsRedisCluster() twice, which is harmless.
This follows the same pattern as _listPrefix in RedisQueue which caches IsCluster() results without locking.

3. XML doc "exactly-once" is misleading

Updated docs to say "preventing per-primary duplicate delivery" instead of "ensuring exactly-once delivery"
Added explicit caveat about not providing end-to-end exactly-once semantics
Updated fallback description to include proxy deployments

Build Failure RCA: `RemoveIfEqualAsync_WithNonMatchingValue_DoesNotPublishInvalidation`

Root Cause: Race condition in the base HybridCacheClientTestBase test (Foundatio.TestHarness 13.0.0-beta1.17)

The bug sequence:

firstCache.SetAsync(key, "value") — publishes InvalidateCache message via Redis pub/sub
secondCache.GetAsync(key) — reads value, populates local cache
initialInvalidateCalls = secondCache.InvalidateCacheCalls — captures counter
The InvalidateCache from step 1 arrives at secondCache after step 3, incrementing the counter and clearing the local cache
Assertion fails: expected initialInvalidateCalls == secondCache.InvalidateCacheCalls but it's initialInvalidateCalls + 1

This is a pre-existing bug, NOT caused by this PR. Verified by reverting to the original RedisChannel.Literal code — the test still fails 100% of the time (3/3 runs). The race existed before but was never caught because the update deps commit on main (which introduced these tests via 13.0.0-beta1.17) had its CI run cancelled.

Fix: Override the test in both RedisHybridCacheClientTests and ScopedRedisHybridCacheClientTests to drain pending invalidations before populating secondCache's local cache. Also fixed upstream in Foundatio TestHarness source.

src/Foundatio.Redis/Extensions/RedisExtensions.cs

… iteration - RedisCacheClient: add _isCluster field resolved in constructor, replacing 6 per-call IsCluster() checks on hot paths (RemoveAll, GetAll, SetAll, etc.) - RedisQueue: consolidate IsCluster() into single constructor local for _listPrefix and _topicChannel - RedisMessageBus: standardize on IsCluster(), remove IsRedisCluster(), lazy ??= for channel caching - Remove IsRedisCluster() extension to keep single API - Revert test overrides back to base class delegation Co-authored-by: Cursor <cursoragent@cursor.com>

mgravell · 2026-02-14T16:41:10Z

I am not familiar with this lib, so I don't have much context: but, RedisChannnel.Literal optionally supports key-like routing, which gives you the same node-routing as the sharded variants without changing the message kind - .WithKeyRouting(). The servers will still broadcast horizontally but effectively one node will be dealing with all the traffic.

If you're seeing a correctness bug: something is wrong. Literal should only connect to one node (whether random or key-like), unless something went very weird with the changes we made for the keyspace/keyevent notifications.

As a side note: if this is dealing with hybrid-cache via SE.Redis... I'm kinda amazed I haven't seen it before 🙃

mgravell · 2026-02-14T17:18:37Z

Observations on RedisChannel?:

IIRC there is an IsNull check on a channel; you could potentially use that to avoid the extra layer of wrapping
beware "tearing" if this is accessed from multiple threads

niemyjski force-pushed the fix/sharded-pubsub-cluster branch from f001533 to 7527190 Compare February 14, 2026 00:57

niemyjski requested a review from Copilot February 14, 2026 00:59

niemyjski self-assigned this Feb 14, 2026

niemyjski added the bug label Feb 14, 2026

niemyjski requested review from ejsmith, randylsu, ttugrad01 and twehner February 14, 2026 00:59

Copilot started reviewing on behalf of niemyjski February 14, 2026 00:59 View session

Use sharded pub/sub for Redis Cluster to fix duplicate message delivery

19c1e47

Co-authored-by: Cursor <cursoragent@cursor.com>

niemyjski force-pushed the fix/sharded-pubsub-cluster branch from 7527190 to 19c1e47 Compare February 14, 2026 01:00

Copilot AI reviewed Feb 14, 2026

View reviewed changes

src/Foundatio.Redis/Messaging/RedisMessageBus.cs Outdated Show resolved Hide resolved

niemyjski requested a review from Copilot February 14, 2026 01:09

Copilot started reviewing on behalf of niemyjski February 14, 2026 01:10 View session

Copilot AI reviewed Feb 14, 2026

View reviewed changes

src/Foundatio.Redis/Messaging/RedisMessageBus.cs Show resolved Hide resolved

niemyjski requested a review from Copilot February 14, 2026 01:27

Copilot started reviewing on behalf of niemyjski February 14, 2026 01:28 View session

Copilot AI reviewed Feb 14, 2026

View reviewed changes

src/Foundatio.Redis/Messaging/RedisMessageBus.cs Outdated Show resolved Hide resolved

src/Foundatio.Redis/Messaging/RedisMessageBus.cs Show resolved Hide resolved

github-code-quality bot found potential problems Feb 14, 2026

View reviewed changes

src/Foundatio.Redis/Extensions/RedisExtensions.cs Fixed Show fixed Hide fixed

niemyjski merged commit 564e7dc into main Feb 14, 2026
4 checks passed

niemyjski deleted the fix/sharded-pubsub-cluster branch February 14, 2026 02:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use sharded pub/sub for Redis Cluster to fix duplicate message delivery#141

Use sharded pub/sub for Redis Cluster to fix duplicate message delivery#141
niemyjski merged 4 commits intomainfrom
fix/sharded-pubsub-cluster

niemyjski commented Feb 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

niemyjski commented Feb 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

niemyjski commented Feb 14, 2026

Uh oh!

Uh oh!

Uh oh!

mgravell commented Feb 14, 2026 •

edited

Loading

Uh oh!

mgravell commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

niemyjski commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

References

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

niemyjski commented Feb 14, 2026

Verification: Bug is real and reproducible

Root Cause: Why Sharded is ~10x faster

PUBLISH (Literal) in a 3-primary cluster:

SPUBLISH (Sharded):

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

niemyjski commented Feb 14, 2026

Addressed Copilot Review Feedback + Fixed Build Failure

Copilot Review Comments Addressed

Build Failure RCA: RemoveIfEqualAsync_WithNonMatchingValue_DoesNotPublishInvalidation

Uh oh!

Uh oh!

Uh oh!

mgravell commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgravell commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

niemyjski commented Feb 14, 2026 •

edited

Loading

Build Failure RCA: `RemoveIfEqualAsync_WithNonMatchingValue_DoesNotPublishInvalidation`

mgravell commented Feb 14, 2026 •

edited

Loading