Skip to content

[FIX][AUTH]: Cross-replica cache invalidation for auth caches#3309

Merged
crivetimihai merged 8 commits intomainfrom
fix-cross-replica-cache-invalidation
Mar 8, 2026
Merged

[FIX][AUTH]: Cross-replica cache invalidation for auth caches#3309
crivetimihai merged 8 commits intomainfrom
fix-cross-replica-cache-invalidation

Conversation

@marekdano
Copy link
Copy Markdown
Collaborator

🐛 Bug-fix PR


📌 Summary

AuthCache published invalidation messages to mcpgw:auth:invalidate, but CacheInvalidationSubscriber only listened on mcpgw:cache:invalidate. In multi-replica deployments, when one replica invalidated auth data (e.g., after a role change or token revocation), other replicas never received the message and continued serving stale in-memory auth caches.

💡 Fix Description

  1. Subscribe to both channels — Added _channels list containing both mcpgw:cache:invalidate and mcpgw:auth:invalidate. Updated start() to subscribe to all channels via await self._pubsub.subscribe(*self._channels) and stop() to unsubscribe from all.
  2. Handle auth invalidation messages — Added 7 new message handlers in _process_invalidation():
    • user:{email} — clears user cache, context cache, and team cache entries for the user
    • revoke:{jti} — adds JTI to revoked set, clears revocation cache and matching context cache entries
    • team:{email} — clears team cache and context cache for the user
    • role:{email}:{team_id} — clears specific role cache entry
    • team_roles:{team_id} — clears all role cache entries for a team
    • teams:{email} — clears teams list cache entries for the user
    • membership:{email} — clears team membership cache entries for the user

The startswith check ordering ensures team_roles: and teams: are matched before team:.

Tests: tests/unit/mcpgateway/cache/test_cache_invalidation_subscriber.py

  • Updated test_subscriber_initialization to assert _channels includes both channels
  • Fixed FakePubSub.subscribe/unsubscribe signatures to accept *_channels (varargs)
  • Added create_mock_auth_cache() helper
  • Added TestAuthCacheInvalidationSubscriber class with 11 tests covering all auth invalidation message types, channel subscription verification, and an end-to-end regression test

🧪 Verification

Check Command Status
Lint suite make lint
Unit tests make test
Coverage ≥ 80 % make coverage
Manual regression no longer fails steps / screenshots

📐 MCP Compliance (if relevant)

  • Matches current MCP spec
  • No breaking change to MCP clients

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • No secrets/credentials committed

@crivetimihai crivetimihai added this to the Release 1.0.0-RC2 milestone Feb 27, 2026
@crivetimihai crivetimihai changed the title fix: cross-replica cache invalidation for auth caches [FIX][AUTH]: Cross-replica cache invalidation for auth caches Feb 27, 2026
@crivetimihai crivetimihai added bug Something isn't working security Improves security MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe labels Feb 27, 2026
@crivetimihai
Copy link
Copy Markdown
Member

Thanks @marekdano. Critical fix for multi-replica deployments — the channel mismatch (mcpgw:auth:invalidate vs mcpgw:cache:invalidate) meant auth cache changes weren't propagated across replicas. The 7 new message handlers cover the key invalidation scenarios. Good test coverage.

jonpspri
jonpspri previously approved these changes Mar 1, 2026
Copy link
Copy Markdown
Collaborator

@jonpspri jonpspri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Approve

Reviewed the full diff (6 commits after rebase onto main), ran all linters and tests.

What was checked

Check Result
Black / isort Clean
flake8 Clean
pylint 10.00/10
bandit 0 issues
Unit tests (49) All pass

Assessment

Core fix is correct. AuthCache publishes to mcpgw:auth:invalidate but the subscriber only listened on mcpgw:cache:invalidate — the channel mismatch meant auth invalidations never propagated across replicas. Subscribing to both channels and adding the 7 auth message handlers resolves the gap.

Security:

  • Channel-origin guard ensures auth-prefixed messages are only accepted from mcpgw:auth:invalidate, preventing spoofing via the cache channel.
  • Final commit closes a subtle bypass where channel="" (empty/falsy) would short-circuit the guard — now rejects unless channel is exactly mcpgw:auth:invalidate.
  • _MAX_REVOKED_JTIS cap (100K) prevents unbounded memory growth from pubsub flooding.

Correctness:

  • All 7 message types published by auth_cache (user:, revoke:, team:, role:, team_roles:, teams:, membership:) have matching handlers.
  • _AUTH_PREFIXES tuple orders team_roles: and teams: before team: to prevent prefix collision in startswith().
  • threading.Lock usage in sync _process_auth_invalidation called from async context is safe — only fast dict ops, no await while held.

Test coverage:

  • 11 new tests covering all 7 handler types, channel guard deny/accept paths, _MAX_REVOKED_JTIS cap, prefix collision regressions, empty identifiers, and identifiers with colons.
  • End-to-end cross-replica regression test included.

@crivetimihai crivetimihai added the merge-queue Rebased and ready to merge label Mar 3, 2026
@jonpspri jonpspri force-pushed the fix-cross-replica-cache-invalidation branch from ecbf5c3 to 6ab1700 Compare March 3, 2026 11:46
@crivetimihai crivetimihai self-assigned this Mar 7, 2026
@crivetimihai crivetimihai added the release-fix Critical bugfix required for the release label Mar 7, 2026
Marek Dano and others added 8 commits March 8, 2026 17:34
Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
- Extract _process_auth_invalidation() from _process_invalidation()
  to reduce method complexity (pylint too-many-statements/locals)
- Add _evict_keys() static helper to deduplicate key-eviction loops
- Remove dead _channel backward-compat attribute
- Run Black on test file to fix pre-existing formatting violations

Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
- Cap _revoked_jtis at 100k entries to prevent unbounded memory growth
  from compromised Redis pubsub channels (DoS mitigation)
- Fix FakePubSub.subscribe signatures in two exception-path tests to
  accept varargs, matching the production *self._channels call
- Add prefix-collision regression tests proving teams:/team_roles:
  messages don't fall through to the team: handler
- Add edge-case tests for empty identifiers, colon-containing
  identifiers, and unusual email formats

Signed-off-by: Jonathan Springer <jps@s390x.com>
Auth-prefixed invalidation messages (user:, revoke:, team:, etc.) are
now only accepted when they arrive on the mcpgw:auth:invalidate channel.
Messages with auth prefixes on other channels are logged and ignored,
preventing cross-channel confusion.

Also fixes the _MAX_REVOKED_JTIS docstring to reference the correct
method name (_load_revoked_jtis in auth_cache).

Signed-off-by: Jonathan Springer <jps@s390x.com>
The previous guard `if channel and channel != "mcpgw:auth:invalidate"`
short-circuited to False when channel was empty, allowing auth
messages to be processed without channel verification. Changed to
`if channel != "mcpgw:auth:invalidate"` so auth messages are only
accepted from the correct channel regardless of how the method is
called.

Signed-off-by: Jonathan Springer <jps@s390x.com>
Add 3 tests that exercise the full _listen_loop -> _process_invalidation
path with channel data from fake pubsub messages:

- test_listen_loop_forwards_channel_to_process_invalidation: verifies
  channel kwarg is forwarded from pubsub message to _process_invalidation
- test_listen_loop_auth_message_evicts_cache_end_to_end: full path from
  pubsub message through auth cache eviction
- test_listen_loop_auth_message_rejected_on_wrong_channel_end_to_end:
  deny-path proving channel guard works through the listener

Also fixes _MAX_REVOKED_JTIS docstring to accurately describe the
reconciliation path (Redis L2 fallback on L1 miss, not the unused
sync_revoked_tokens method).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
The if/elif chain comment said "Order matters: team_roles/teams must be
checked before team" but there is no actual startswith() prefix collision
between these prefixes (teams: != team:, team_roles: != team:).  Reword
the comment to describe correct dispatch and rename the two tests from
"prefix_collision" to "dispatches_to_correct_handler" so they describe
what they actually verify.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai force-pushed the fix-cross-replica-cache-invalidation branch from 9a4b78c to 423a78f Compare March 8, 2026 18:49
Copy link
Copy Markdown
Member

@crivetimihai crivetimihai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Approve

Rebased onto current main (clean, no conflicts), reviewed the full implementation, ran linting, tests, and verified on a live 3-replica deployment with Redis.

Verification

Check Result
Black / isort / flake8 / bandit Clean
Unit tests (52, up from 49) All pass
Full cache module tests (440) All pass
Live deployment (3 replicas + Redis + nginx LB) All endpoints healthy
Redis pubsub mcpgw:auth:invalidate subscribers 72 (matches mcpgw:cache:invalidate)
All 7 auth invalidation message types via Redis 72 subscribers received each
Channel-origin guard (auth msgs on wrong channel) Correctly rejected
Admin UI (login, dashboard, tools, tokens) Working

Changes made during review (2 commits)

1. test: add listener-path regression tests for channel forwarding

Added 3 tests that exercise the full _listen_loop_process_invalidation path with channel data from fake pubsub messages, closing a coverage gap identified during review:

  • test_listen_loop_forwards_channel_to_process_invalidation — verifies channel kwarg is forwarded from pubsub message
  • test_listen_loop_auth_message_evicts_cache_end_to_end — full path from pubsub message through auth cache eviction
  • test_listen_loop_auth_message_rejected_on_wrong_channel_end_to_end — deny-path proving channel guard works through the listener

Also fixed _MAX_REVOKED_JTIS docstring to accurately describe the reconciliation path (Redis L2 fallback on L1 miss via is_token_revoked() / get_auth_context(), not the unused sync_revoked_tokens() method).

2. refactor: clarify auth dispatch comment and test names

The comment "Order matters: team_roles/teams must be checked before team" was misleading — there is no actual startswith() prefix collision between teams:, team_roles:, and team: (the 5th character differs: s/_ vs :). Reworded the comment and renamed two test methods from prefix_collision to dispatches_to_correct_handler.

Assessment

  • Core fix is correct. All 7 publish calls in auth_cache.py have matching handlers that replicate the exact L1 eviction logic.
  • Security is sound. Channel-origin guard rejects auth messages on wrong channel; empty/missing channel fails closed; _MAX_REVOKED_JTIS caps memory growth.
  • Thread safety matches existing patterns. Short lock scopes with no await under lock.
  • Test coverage is comprehensive. 52 tests including listener-path, channel guard deny/accept, cap enforcement, dispatch correctness, edge cases, and end-to-end regression.

Note for follow-up

invalidate_team() in auth_cache.py:508 does _team_cache.pop(email, None) with a bare email key, but production code writes _team_cache keys as f"{user_email}:{sorted_teams}". This means the pop never matches a real entry. This is pre-existing behavior (not introduced by this PR) and the subscriber correctly mirrors it. Worth a separate fix.

@crivetimihai crivetimihai merged commit f3d4849 into main Mar 8, 2026
40 of 41 checks passed
@crivetimihai crivetimihai deleted the fix-cross-replica-cache-invalidation branch March 8, 2026 19:51
MohanLaksh pushed a commit that referenced this pull request Mar 12, 2026
* fix: cross-replica cache invalidation for auth caches

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* refactor: extract auth invalidation into helper and fix formatting

- Extract _process_auth_invalidation() from _process_invalidation()
  to reduce method complexity (pylint too-many-statements/locals)
- Add _evict_keys() static helper to deduplicate key-eviction loops
- Remove dead _channel backward-compat attribute
- Run Black on test file to fix pre-existing formatting violations

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: add missing docstring params in _evict_keys

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: harden cross-replica revoke handler and improve test coverage

- Cap _revoked_jtis at 100k entries to prevent unbounded memory growth
  from compromised Redis pubsub channels (DoS mitigation)
- Fix FakePubSub.subscribe signatures in two exception-path tests to
  accept varargs, matching the production *self._channels call
- Add prefix-collision regression tests proving teams:/team_roles:
  messages don't fall through to the team: handler
- Add edge-case tests for empty identifiers, colon-containing
  identifiers, and unusual email formats

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: enforce channel-origin guard on auth invalidation messages

Auth-prefixed invalidation messages (user:, revoke:, team:, etc.) are
now only accepted when they arrive on the mcpgw:auth:invalidate channel.
Messages with auth prefixes on other channels are logged and ignored,
preventing cross-channel confusion.

Also fixes the _MAX_REVOKED_JTIS docstring to reference the correct
method name (_load_revoked_jtis in auth_cache).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: close channel-guard bypass for empty channel in auth invalidation

The previous guard `if channel and channel != "mcpgw:auth:invalidate"`
short-circuited to False when channel was empty, allowing auth
messages to be processed without channel verification. Changed to
`if channel != "mcpgw:auth:invalidate"` so auth messages are only
accepted from the correct channel regardless of how the method is
called.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* test: add listener-path regression tests for channel forwarding

Add 3 tests that exercise the full _listen_loop -> _process_invalidation
path with channel data from fake pubsub messages:

- test_listen_loop_forwards_channel_to_process_invalidation: verifies
  channel kwarg is forwarded from pubsub message to _process_invalidation
- test_listen_loop_auth_message_evicts_cache_end_to_end: full path from
  pubsub message through auth cache eviction
- test_listen_loop_auth_message_rejected_on_wrong_channel_end_to_end:
  deny-path proving channel guard works through the listener

Also fixes _MAX_REVOKED_JTIS docstring to accurately describe the
reconciliation path (Redis L2 fallback on L1 miss, not the unused
sync_revoked_tokens method).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* refactor: clarify auth dispatch comment and test names

The if/elif chain comment said "Order matters: team_roles/teams must be
checked before team" but there is no actual startswith() prefix collision
between these prefixes (teams: != team:, team_roles: != team:).  Reword
the comment to describe correct dispatch and rename the two tests from
"prefix_collision" to "dispatches_to_correct_handler" so they describe
what they actually verify.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Marek Dano <Marek.Dano@ibm.com>
Co-authored-by: Jonathan Springer <jps@s390x.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Yosiefeyob pushed a commit that referenced this pull request Mar 13, 2026
* fix: cross-replica cache invalidation for auth caches

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>

* refactor: extract auth invalidation into helper and fix formatting

- Extract _process_auth_invalidation() from _process_invalidation()
  to reduce method complexity (pylint too-many-statements/locals)
- Add _evict_keys() static helper to deduplicate key-eviction loops
- Remove dead _channel backward-compat attribute
- Run Black on test file to fix pre-existing formatting violations

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: add missing docstring params in _evict_keys

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: harden cross-replica revoke handler and improve test coverage

- Cap _revoked_jtis at 100k entries to prevent unbounded memory growth
  from compromised Redis pubsub channels (DoS mitigation)
- Fix FakePubSub.subscribe signatures in two exception-path tests to
  accept varargs, matching the production *self._channels call
- Add prefix-collision regression tests proving teams:/team_roles:
  messages don't fall through to the team: handler
- Add edge-case tests for empty identifiers, colon-containing
  identifiers, and unusual email formats

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: enforce channel-origin guard on auth invalidation messages

Auth-prefixed invalidation messages (user:, revoke:, team:, etc.) are
now only accepted when they arrive on the mcpgw:auth:invalidate channel.
Messages with auth prefixes on other channels are logged and ignored,
preventing cross-channel confusion.

Also fixes the _MAX_REVOKED_JTIS docstring to reference the correct
method name (_load_revoked_jtis in auth_cache).

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: close channel-guard bypass for empty channel in auth invalidation

The previous guard `if channel and channel != "mcpgw:auth:invalidate"`
short-circuited to False when channel was empty, allowing auth
messages to be processed without channel verification. Changed to
`if channel != "mcpgw:auth:invalidate"` so auth messages are only
accepted from the correct channel regardless of how the method is
called.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* test: add listener-path regression tests for channel forwarding

Add 3 tests that exercise the full _listen_loop -> _process_invalidation
path with channel data from fake pubsub messages:

- test_listen_loop_forwards_channel_to_process_invalidation: verifies
  channel kwarg is forwarded from pubsub message to _process_invalidation
- test_listen_loop_auth_message_evicts_cache_end_to_end: full path from
  pubsub message through auth cache eviction
- test_listen_loop_auth_message_rejected_on_wrong_channel_end_to_end:
  deny-path proving channel guard works through the listener

Also fixes _MAX_REVOKED_JTIS docstring to accurately describe the
reconciliation path (Redis L2 fallback on L1 miss, not the unused
sync_revoked_tokens method).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* refactor: clarify auth dispatch comment and test names

The if/elif chain comment said "Order matters: team_roles/teams must be
checked before team" but there is no actual startswith() prefix collision
between these prefixes (teams: != team:, team_roles: != team:).  Reword
the comment to describe correct dispatch and rename the two tests from
"prefix_collision" to "dispatches_to_correct_handler" so they describe
what they actually verify.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Marek Dano <Marek.Dano@ibm.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Marek Dano <Marek.Dano@ibm.com>
Co-authored-by: Jonathan Springer <jps@s390x.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Yosief Eyob <yosiefogbazion@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working merge-queue Rebased and ready to merge MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe release-fix Critical bugfix required for the release security Improves security

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants