fix(rate-limiter): shared state, eviction, thread safety, config validation by gandhipratik203 · Pull Request #3750 · IBM/mcp-context-forge

gandhipratik203 · 2026-03-19T19:11:37Z

Summary

Addresses gaps documented in issue #3740 (sub-task of #3735). The rate limiter plugin had several correctness issues that made it silently ineffective in multi-instance deployments.

Gaps closed

Gap 1 (HIGH) — No shared state across instances: each gateway process had its own `_store` dict, making the effective per-user limit `N × configured_limit` where N = number of instances. Fixed with `RedisBackend` using an atomic Lua INCR+EXPIRE script — all instances share the same counter.

Gap 2 (MEDIUM) — No store eviction: the in-memory `_store` dict grew unboundedly as unique users accumulated. Fixed with a background asyncio sweep task in `MemoryBackend` that evicts expired windows every 0.5s.

Gap 3 (MEDIUM) — No thread safety: dict access was not atomic under concurrent asyncio tasks. Fixed with `asyncio.Lock` in `MemoryBackend.allow()`.

Gap 5 (LOW) — `prompt_pre_fetch` silently ignored `by_tool` config. Fixed by adding the same `by_tool` lookup to the prompt hook.

Gap 6 (MEDIUM) — Malformed config (`"abc/m"`) raised `ValueError` at request time. Fixed with `_validate_config()` called at `init` — fails fast at startup.

Gap 7 (LOW) — Unsupported rate unit (`"60/d"`) raised `ValueError` at request time. Fixed by `_parse_rate()` raising early during `_validate_config()`.

Gap 8 (MEDIUM) — Any unexpected exception in the plugin propagated to the caller. Fixed with `try/except Exception` wrapping both hooks; logs and returns permissive result on unexpected error.

Gap 4 (sliding window) intentionally deferred — the fixed window burst-at-boundary is a known trade-off. The `test_fixed_window_burst_at_boundary` xfail test documents the gap.

Additional hardening

PII fix — violation descriptions previously included `user` and `tenant` identifiers in the human-readable string (e.g. `"Rate limit exceeded for tool search, user alice@example.com, or tenant acme-corp"`). This string is logged directly in permissive mode and embedded in `PluginViolationError` messages in enforce mode. Fixed to only include the tool/prompt name — structured metadata remains in the `details` dict.

Bypass resistance documented — three bypass vectors identified and tested:

Whitespace-only user identity creates its own bucket (xfail, documented)
Tool name case sensitivity bypasses `by_tool` limits (xfail, documented)
Tool name whitespace bypasses `by_tool` limits (xfail, documented)

README rewritten — the previous README described the old pre-fix implementation (in-memory only, Redis listed as a TODO). Now documents both backends, full configuration reference, response headers, multi-instance deployment guidance, and an accurate limitations table.

Architecture

`MemoryBackend`: `asyncio.Lock` + background TTL sweep; single-process only
`RedisBackend`: atomic Lua INCR+EXPIRE; shared across all gateway instances; falls back to `MemoryBackend` on Redis unavailability if `redis_fallback: true`
`_validate_config()`: parses all rate strings at `init`; collects all errors and raises `ValueError` with a full report
Both hooks: `try/except Exception` with permissive degradation on unexpected errors

Test results

Unit tests: 70 passed, 4 xfailed

Core enforcement, headers, multi-dimensional limiting
Permissive vs enforce mode (via `PluginExecutor`)
Redis fallback (broken Redis → memory, limit still enforced, no-fallback graceful degradation)
Cross-tenant isolation and counter bleed prevention
Header accuracy (Retry-After bounds, Reset consistency, Remaining decrement)
Bypass resistance (None/whitespace identity, tool name normalisation gaps)
PII: violation description contains no user or tenant identifiers

Integration tests: 16/16 passed (`tests/integration/test_rate_limiter.py`, `--with-integration`)

Plugin framework tests: 53/53 passed

Performance profiling (`tests/performance/test_plugins_performance.py`, 1000 iterations):

Hook	Memory backend	Redis backend	Delta
`tool_pre_invoke`	0.047ms	0.124ms	+0.077ms
`prompt_pre_fetch`	0.044ms	0.117ms	+0.073ms

Redis adds ~0.075ms per hook call — well under 1ms.

Multi-instance load test (`make benchmark-rate-limiter`, 3 gateway instances, 1 user at 60 req/min = 2× the 30/m limit, 120s):

Backend	Total requests	Allowed	Blocked	Verdict
Memory (before)	~120	~120	~0	Limit not enforced
Redis (after)	120	38	82 (40%*)	Limit correctly enforced

*40% observed vs ~50% theoretical: the first 30 requests in each 60-second window are always allowed before blocking starts. Over a longer run it converges toward ~50%.

Redis MONITOR confirmed all 3 instances write to the same atomic counter:
```
[172.18.0.8] EVAL "rl:user:admin@example.com:60" "60"
[172.18.0.9] EVAL "rl:user:admin@example.com:60" "60"
[172.18.0.10] EVAL "rl:user:admin@example.com:60" "60"
```

Closes #3740

crivetimihai · 2026-03-20T08:34:53Z

Thanks @gandhipratik203. This is a thorough and well-documented fix — correctly addresses 7 of 8 gaps from #3740 with the Redis backend, eviction, thread safety, config validation, and PII fix. Test coverage is strong (70 unit + 16 integration + multi-instance load test). Two notes: (1) the try/except Exception permissive degradation is the right call for a rate limiter, (2) the deferred sliding window gap is well-documented with xfail tests. Solid work.

Closes #3740 ## What changed ### Plugin fixes (plugins/rate_limiter/rate_limiter.py) - Config validation at __init__: _validate_config() parses all rate strings at startup — bad config raises immediately, not mid-request - Graceful degradation: both hooks wrapped in try/except; unexpected errors are logged and the request is allowed through (permissive) - prompt_pre_fetch now enforces by_tool limits using prompt_id as key - MemoryBackend: asyncio.Lock makes counter increments atomic - MemoryBackend: background TTL sweep evicts expired windows (0.5s interval) - RedisBackend: atomic INCR+EXPIRE via Lua script; shared state across all gateway instances; native TTL expiry; falls back to memory on error ### Test additions (tests/unit/.../test_rate_limiter.py) - Gap tests: 4 xfail -> pass (shared state, eviction, prompt by_tool, graceful degradation); 1 xfail remains (fixed window burst, deferred) - Edge case tests: malformed/unsupported config raises at init (not request time); runtime errors degrade gracefully via mock injection - Redis backend test uses injected FakeRedis — no live server required ### Config changes - plugins/config.yaml: RateLimiterPlugin enabled with enforce mode - tests/performance/plugins/config.yaml: RateLimiterPlugin set to permissive for inclusion in cProfile benchmark runs Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…config.yaml Switch the default stack config from in-memory to Redis-backed rate limiting. This ensures the 30/m per-user limit is enforced as a true shared limit across all gateway instances rather than 30/m per process. Validated via Redis MONITOR: all 3 gateway instances atomically increment the same rl:user:<id>:60 counter via the Lua INCR+EXPIRE script. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…stance correctness Adds locustfile_rate_limiter.py and a make benchmark-rate-limiter target to demonstrate the multi-instance rate limit enforcement gap and its fix. The test sends 1 req/s (60 req/min = 2x the 30/m limit) through 3 gateway instances. With a memory backend each instance only sees ~20 req/min and never fires the limiter (~0% failures). With the Redis backend the shared counter reaches 30/min and blocks ~50% of requests — clearly showing the fix works across instances. Expected results: Memory backend: ~0% blocked (each instance sees 20 req/min < 30/m limit) Redis backend: ~50% blocked (shared counter: 60 req/min > 30/m limit) Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…and updated docs - Add 22 new unit tests (70 passed total, 4 xfailed): - Permissive vs enforce mode through PluginExecutor - Redis fallback: memory takeover when Redis is down, limit still enforced, no-fallback graceful degradation - Cross-tenant isolation: independent counters, no counter bleed between tenants - Header accuracy: Retry-After bounds, X-RateLimit-Reset future/consistency, Remaining decrement - Bypass resistance: None/whitespace user identity, tool name case sensitivity and whitespace (documented as xfail gaps) - PII: violation description must not contain user or tenant identifiers - Fix PII leak in violation description: remove user/tenant from description string in both prompt_pre_fetch and tool_pre_invoke — identifiers appeared in log output via permissive-mode manager warning and enforce-mode PluginViolationError message - Rewrite plugins/rate_limiter/README.md: was describing the old pre-fix implementation (in-memory only, no Redis, Redis as TODO). Now documents both backends, full config reference, response headers, examples, and accurate limitations table - Update plugin-manifest.yaml description to reflect Redis backend support Closes #3740 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

… validation - Remove unused _allow() module-level function (dead code — plugin uses self._rate_backend.allow() directly) - Fix test_graceful_degradation test: was patching _allow() which is never called by the plugin; now patches backend.allow() via patch.object so the try/except error path is actually exercised - Add prompt_pre_fetch graceful degradation test (was only tested for tool_pre_invoke) - Fix inconsistent by_tool lookup in tool_pre_invoke: remove unnecessary hasattr(__contains__) guard, align with prompt_pre_fetch pattern - Add backend validation to _validate_config(): typo like 'reddis' now raises ValueError at startup instead of silently falling back to memory - Add test for malformed by_tool rate string validation - Add test for invalid backend name validation - Change default config mode from enforce to permissive for safety (consistent with all other security plugins in the default config) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai

Review: Approved with fixes applied

Solid PR addressing real gaps in the rate limiter plugin. The Redis backend with atomic Lua INCR+EXPIRE, memory eviction sweep, asyncio.Lock, and config validation are all well-designed and consistent with the codebase's plugin patterns.

Fixes applied in review commit

Bug: Dead code removed — _allow() module-level function was never called by the plugin (which uses self._rate_backend.allow()). Removed.

Bug: False-positive test fixed — test_graceful_degradation_unexpected_runtime_error_does_not_crash_caller patched _allow() (never called), so the try/except error path was never exercised. The test passed vacuously. Fixed to use patch.object(plugin._rate_backend, "allow", ...) which actually triggers the error handling.

Bug: Inconsistent by_tool lookup — tool_pre_invoke had an unnecessary hasattr(by_tool_config, "__contains__") guard; prompt_pre_fetch didn't. Since by_tool is always Optional[Dict] from Pydantic, the hasattr check is redundant. Aligned both hooks.

Enhancement: Backend validation — A typo like backend: "reddis" silently fell back to memory with no error. Added validation in _validate_config() to fail fast at startup.

Config: Mode changed to permissive — All other security/safety plugins in the default plugins/config.yaml ship as permissive. Switching rate limiter to enforce would be inconsistent and could break existing deployments on update. Changed to permissive for safety — operators can opt into enforce.

Tests added — Added 3 new tests: prompt_pre_fetch graceful degradation, invalid backend validation, malformed by_tool rate validation. Coverage at 97% (remaining 3% is defensive code for no-event-loop and real Redis client init — covered by integration tests).

Verified

73 unit tests pass, 4 xfail (documented gaps)
1508 plugin tests pass with no regressions
Architecture consistent with existing plugin patterns (Config model, init, error handling)
No security concerns: PII removed from violation descriptions, fail-open on errors, no sensitive data in headers
Redis Lua script is standard atomic INCR+EXPIRE pattern
xfail tests properly document known gaps (sliding window, whitespace bypass, case sensitivity)

gandhipratik203 requested review from araujof, crivetimihai, jonpspri, kevalmahajan, madhav165 and terylt as code owners March 19, 2026 19:11

crivetimihai added bug Something isn't working SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release plugins performance Performance related items security Improves security labels Mar 20, 2026

crivetimihai added this to the Release 1.0.0 milestone Mar 20, 2026

gandhipratik203 added the wxo wxo integration label Mar 20, 2026

gandhipratik203 self-assigned this Mar 20, 2026

gandhipratik203 requested a review from ja8zyjits March 20, 2026 13:54

crivetimihai added MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe and removed SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release labels Mar 20, 2026

ja8zyjits mentioned this pull request Mar 20, 2026

[FEATURE][PLUGINS]: Improve rate-limit policy & semantics — IP scope, daily units, per-endpoint rules, distributed backend #3007

Open

44 tasks

crivetimihai added the release-fix Critical bugfix required for the release label Mar 21, 2026

gandhipratik203 and others added 5 commits March 21, 2026 13:03

crivetimihai force-pushed the test/rate-limiter-plugin branch from 10e120c to 222ba22 Compare March 21, 2026 13:12

crivetimihai approved these changes Mar 21, 2026

View reviewed changes

crivetimihai merged commit e1b6c68 into main Mar 21, 2026
36 checks passed

crivetimihai deleted the test/rate-limiter-plugin branch March 21, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rate-limiter): shared state, eviction, thread safety, config validation#3750

fix(rate-limiter): shared state, eviction, thread safety, config validation#3750
crivetimihai merged 5 commits intomainfrom
test/rate-limiter-plugin

gandhipratik203 commented Mar 19, 2026 •

edited

Loading

Uh oh!

crivetimihai commented Mar 20, 2026

Uh oh!

crivetimihai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gandhipratik203 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Gaps closed

Additional hardening

Architecture

Test results

Uh oh!

crivetimihai commented Mar 20, 2026

Uh oh!

crivetimihai left a comment

Choose a reason for hiding this comment

Review: Approved with fixes applied

Fixes applied in review commit

Verified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gandhipratik203 commented Mar 19, 2026 •

edited

Loading