[FIX][AUTH]: Cross-replica cache invalidation for auth caches#3309
[FIX][AUTH]: Cross-replica cache invalidation for auth caches#3309crivetimihai merged 8 commits intomainfrom
Conversation
|
Thanks @marekdano. Critical fix for multi-replica deployments — the channel mismatch ( |
c38981b to
ecbf5c3
Compare
jonpspri
left a comment
There was a problem hiding this comment.
Review: Approve
Reviewed the full diff (6 commits after rebase onto main), ran all linters and tests.
What was checked
| Check | Result |
|---|---|
| Black / isort | Clean |
| flake8 | Clean |
| pylint | 10.00/10 |
| bandit | 0 issues |
| Unit tests (49) | All pass |
Assessment
Core fix is correct. AuthCache publishes to mcpgw:auth:invalidate but the subscriber only listened on mcpgw:cache:invalidate — the channel mismatch meant auth invalidations never propagated across replicas. Subscribing to both channels and adding the 7 auth message handlers resolves the gap.
Security:
- Channel-origin guard ensures auth-prefixed messages are only accepted from
mcpgw:auth:invalidate, preventing spoofing via the cache channel. - Final commit closes a subtle bypass where
channel=""(empty/falsy) would short-circuit the guard — now rejects unless channel is exactlymcpgw:auth:invalidate. _MAX_REVOKED_JTIScap (100K) prevents unbounded memory growth from pubsub flooding.
Correctness:
- All 7 message types published by
auth_cache(user:,revoke:,team:,role:,team_roles:,teams:,membership:) have matching handlers. _AUTH_PREFIXEStuple ordersteam_roles:andteams:beforeteam:to prevent prefix collision instartswith().threading.Lockusage in sync_process_auth_invalidationcalled from async context is safe — only fast dict ops, no await while held.
Test coverage:
- 11 new tests covering all 7 handler types, channel guard deny/accept paths,
_MAX_REVOKED_JTIScap, prefix collision regressions, empty identifiers, and identifiers with colons. - End-to-end cross-replica regression test included.
ecbf5c3 to
6ab1700
Compare
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
- Extract _process_auth_invalidation() from _process_invalidation() to reduce method complexity (pylint too-many-statements/locals) - Add _evict_keys() static helper to deduplicate key-eviction loops - Remove dead _channel backward-compat attribute - Run Black on test file to fix pre-existing formatting violations Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
- Cap _revoked_jtis at 100k entries to prevent unbounded memory growth from compromised Redis pubsub channels (DoS mitigation) - Fix FakePubSub.subscribe signatures in two exception-path tests to accept varargs, matching the production *self._channels call - Add prefix-collision regression tests proving teams:/team_roles: messages don't fall through to the team: handler - Add edge-case tests for empty identifiers, colon-containing identifiers, and unusual email formats Signed-off-by: Jonathan Springer <jps@s390x.com>
Auth-prefixed invalidation messages (user:, revoke:, team:, etc.) are now only accepted when they arrive on the mcpgw:auth:invalidate channel. Messages with auth prefixes on other channels are logged and ignored, preventing cross-channel confusion. Also fixes the _MAX_REVOKED_JTIS docstring to reference the correct method name (_load_revoked_jtis in auth_cache). Signed-off-by: Jonathan Springer <jps@s390x.com>
The previous guard `if channel and channel != "mcpgw:auth:invalidate"` short-circuited to False when channel was empty, allowing auth messages to be processed without channel verification. Changed to `if channel != "mcpgw:auth:invalidate"` so auth messages are only accepted from the correct channel regardless of how the method is called. Signed-off-by: Jonathan Springer <jps@s390x.com>
Add 3 tests that exercise the full _listen_loop -> _process_invalidation path with channel data from fake pubsub messages: - test_listen_loop_forwards_channel_to_process_invalidation: verifies channel kwarg is forwarded from pubsub message to _process_invalidation - test_listen_loop_auth_message_evicts_cache_end_to_end: full path from pubsub message through auth cache eviction - test_listen_loop_auth_message_rejected_on_wrong_channel_end_to_end: deny-path proving channel guard works through the listener Also fixes _MAX_REVOKED_JTIS docstring to accurately describe the reconciliation path (Redis L2 fallback on L1 miss, not the unused sync_revoked_tokens method). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
The if/elif chain comment said "Order matters: team_roles/teams must be checked before team" but there is no actual startswith() prefix collision between these prefixes (teams: != team:, team_roles: != team:). Reword the comment to describe correct dispatch and rename the two tests from "prefix_collision" to "dispatches_to_correct_handler" so they describe what they actually verify. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
9a4b78c to
423a78f
Compare
crivetimihai
left a comment
There was a problem hiding this comment.
Review: Approve
Rebased onto current main (clean, no conflicts), reviewed the full implementation, ran linting, tests, and verified on a live 3-replica deployment with Redis.
Verification
| Check | Result |
|---|---|
| Black / isort / flake8 / bandit | Clean |
| Unit tests (52, up from 49) | All pass |
| Full cache module tests (440) | All pass |
| Live deployment (3 replicas + Redis + nginx LB) | All endpoints healthy |
Redis pubsub mcpgw:auth:invalidate subscribers |
72 (matches mcpgw:cache:invalidate) |
| All 7 auth invalidation message types via Redis | 72 subscribers received each |
| Channel-origin guard (auth msgs on wrong channel) | Correctly rejected |
| Admin UI (login, dashboard, tools, tokens) | Working |
Changes made during review (2 commits)
1. test: add listener-path regression tests for channel forwarding
Added 3 tests that exercise the full _listen_loop → _process_invalidation path with channel data from fake pubsub messages, closing a coverage gap identified during review:
test_listen_loop_forwards_channel_to_process_invalidation— verifies channel kwarg is forwarded from pubsub messagetest_listen_loop_auth_message_evicts_cache_end_to_end— full path from pubsub message through auth cache evictiontest_listen_loop_auth_message_rejected_on_wrong_channel_end_to_end— deny-path proving channel guard works through the listener
Also fixed _MAX_REVOKED_JTIS docstring to accurately describe the reconciliation path (Redis L2 fallback on L1 miss via is_token_revoked() / get_auth_context(), not the unused sync_revoked_tokens() method).
2. refactor: clarify auth dispatch comment and test names
The comment "Order matters: team_roles/teams must be checked before team" was misleading — there is no actual startswith() prefix collision between teams:, team_roles:, and team: (the 5th character differs: s/_ vs :). Reworded the comment and renamed two test methods from prefix_collision to dispatches_to_correct_handler.
Assessment
- Core fix is correct. All 7 publish calls in
auth_cache.pyhave matching handlers that replicate the exact L1 eviction logic. - Security is sound. Channel-origin guard rejects auth messages on wrong channel; empty/missing channel fails closed;
_MAX_REVOKED_JTIScaps memory growth. - Thread safety matches existing patterns. Short lock scopes with no
awaitunder lock. - Test coverage is comprehensive. 52 tests including listener-path, channel guard deny/accept, cap enforcement, dispatch correctness, edge cases, and end-to-end regression.
Note for follow-up
invalidate_team() in auth_cache.py:508 does _team_cache.pop(email, None) with a bare email key, but production code writes _team_cache keys as f"{user_email}:{sorted_teams}". This means the pop never matches a real entry. This is pre-existing behavior (not introduced by this PR) and the subscriber correctly mirrors it. Worth a separate fix.
* fix: cross-replica cache invalidation for auth caches Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * refactor: extract auth invalidation into helper and fix formatting - Extract _process_auth_invalidation() from _process_invalidation() to reduce method complexity (pylint too-many-statements/locals) - Add _evict_keys() static helper to deduplicate key-eviction loops - Remove dead _channel backward-compat attribute - Run Black on test file to fix pre-existing formatting violations Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: add missing docstring params in _evict_keys Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: harden cross-replica revoke handler and improve test coverage - Cap _revoked_jtis at 100k entries to prevent unbounded memory growth from compromised Redis pubsub channels (DoS mitigation) - Fix FakePubSub.subscribe signatures in two exception-path tests to accept varargs, matching the production *self._channels call - Add prefix-collision regression tests proving teams:/team_roles: messages don't fall through to the team: handler - Add edge-case tests for empty identifiers, colon-containing identifiers, and unusual email formats Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: enforce channel-origin guard on auth invalidation messages Auth-prefixed invalidation messages (user:, revoke:, team:, etc.) are now only accepted when they arrive on the mcpgw:auth:invalidate channel. Messages with auth prefixes on other channels are logged and ignored, preventing cross-channel confusion. Also fixes the _MAX_REVOKED_JTIS docstring to reference the correct method name (_load_revoked_jtis in auth_cache). Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: close channel-guard bypass for empty channel in auth invalidation The previous guard `if channel and channel != "mcpgw:auth:invalidate"` short-circuited to False when channel was empty, allowing auth messages to be processed without channel verification. Changed to `if channel != "mcpgw:auth:invalidate"` so auth messages are only accepted from the correct channel regardless of how the method is called. Signed-off-by: Jonathan Springer <jps@s390x.com> * test: add listener-path regression tests for channel forwarding Add 3 tests that exercise the full _listen_loop -> _process_invalidation path with channel data from fake pubsub messages: - test_listen_loop_forwards_channel_to_process_invalidation: verifies channel kwarg is forwarded from pubsub message to _process_invalidation - test_listen_loop_auth_message_evicts_cache_end_to_end: full path from pubsub message through auth cache eviction - test_listen_loop_auth_message_rejected_on_wrong_channel_end_to_end: deny-path proving channel guard works through the listener Also fixes _MAX_REVOKED_JTIS docstring to accurately describe the reconciliation path (Redis L2 fallback on L1 miss, not the unused sync_revoked_tokens method). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor: clarify auth dispatch comment and test names The if/elif chain comment said "Order matters: team_roles/teams must be checked before team" but there is no actual startswith() prefix collision between these prefixes (teams: != team:, team_roles: != team:). Reword the comment to describe correct dispatch and rename the two tests from "prefix_collision" to "dispatches_to_correct_handler" so they describe what they actually verify. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Marek Dano <Marek.Dano@ibm.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Marek Dano <Marek.Dano@ibm.com> Co-authored-by: Jonathan Springer <jps@s390x.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: cross-replica cache invalidation for auth caches Signed-off-by: Marek Dano <Marek.Dano@ibm.com> * refactor: extract auth invalidation into helper and fix formatting - Extract _process_auth_invalidation() from _process_invalidation() to reduce method complexity (pylint too-many-statements/locals) - Add _evict_keys() static helper to deduplicate key-eviction loops - Remove dead _channel backward-compat attribute - Run Black on test file to fix pre-existing formatting violations Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: add missing docstring params in _evict_keys Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: harden cross-replica revoke handler and improve test coverage - Cap _revoked_jtis at 100k entries to prevent unbounded memory growth from compromised Redis pubsub channels (DoS mitigation) - Fix FakePubSub.subscribe signatures in two exception-path tests to accept varargs, matching the production *self._channels call - Add prefix-collision regression tests proving teams:/team_roles: messages don't fall through to the team: handler - Add edge-case tests for empty identifiers, colon-containing identifiers, and unusual email formats Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: enforce channel-origin guard on auth invalidation messages Auth-prefixed invalidation messages (user:, revoke:, team:, etc.) are now only accepted when they arrive on the mcpgw:auth:invalidate channel. Messages with auth prefixes on other channels are logged and ignored, preventing cross-channel confusion. Also fixes the _MAX_REVOKED_JTIS docstring to reference the correct method name (_load_revoked_jtis in auth_cache). Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: close channel-guard bypass for empty channel in auth invalidation The previous guard `if channel and channel != "mcpgw:auth:invalidate"` short-circuited to False when channel was empty, allowing auth messages to be processed without channel verification. Changed to `if channel != "mcpgw:auth:invalidate"` so auth messages are only accepted from the correct channel regardless of how the method is called. Signed-off-by: Jonathan Springer <jps@s390x.com> * test: add listener-path regression tests for channel forwarding Add 3 tests that exercise the full _listen_loop -> _process_invalidation path with channel data from fake pubsub messages: - test_listen_loop_forwards_channel_to_process_invalidation: verifies channel kwarg is forwarded from pubsub message to _process_invalidation - test_listen_loop_auth_message_evicts_cache_end_to_end: full path from pubsub message through auth cache eviction - test_listen_loop_auth_message_rejected_on_wrong_channel_end_to_end: deny-path proving channel guard works through the listener Also fixes _MAX_REVOKED_JTIS docstring to accurately describe the reconciliation path (Redis L2 fallback on L1 miss, not the unused sync_revoked_tokens method). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor: clarify auth dispatch comment and test names The if/elif chain comment said "Order matters: team_roles/teams must be checked before team" but there is no actual startswith() prefix collision between these prefixes (teams: != team:, team_roles: != team:). Reword the comment to describe correct dispatch and rename the two tests from "prefix_collision" to "dispatches_to_correct_handler" so they describe what they actually verify. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Marek Dano <Marek.Dano@ibm.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Marek Dano <Marek.Dano@ibm.com> Co-authored-by: Jonathan Springer <jps@s390x.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: Yosief Eyob <yosiefogbazion@gmail.com>
🐛 Bug-fix PR
📌 Summary
AuthCache published invalidation messages to
mcpgw:auth:invalidate, butCacheInvalidationSubscriberonly listened onmcpgw:cache:invalidate. In multi-replica deployments, when one replica invalidated auth data (e.g., after a role change or token revocation), other replicas never received the message and continued serving stale in-memory auth caches.💡 Fix Description
_channelslist containing bothmcpgw:cache:invalidateandmcpgw:auth:invalidate. Updatedstart()to subscribe to all channels via awaitself._pubsub.subscribe(*self._channels)andstop()to unsubscribe from all._process_invalidation():user:{email}— clears user cache, context cache, and team cache entries for the userrevoke:{jti}— adds JTI to revoked set, clears revocation cache and matching context cache entriesteam:{email}— clears team cache and context cache for the userrole:{email}:{team_id}— clears specific role cache entryteam_roles:{team_id}— clears all role cache entries for a teamteams:{email}— clears teams list cache entries for the usermembership:{email}— clears team membership cache entries for the userThe startswith check ordering ensures team_roles: and teams: are matched before team:.
Tests:
tests/unit/mcpgateway/cache/test_cache_invalidation_subscriber.pytest_subscriber_initializationto assert_channelsincludes both channelsFakePubSub.subscribe/unsubscribesignatures to accept*_channels(varargs)create_mock_auth_cache()helperTestAuthCacheInvalidationSubscriberclass with 11 tests covering all auth invalidation message types, channel subscription verification, and an end-to-end regression test🧪 Verification
make lintmake testmake coverage📐 MCP Compliance (if relevant)
✅ Checklist
make black isort pre-commit)