Skip to content

fix(rate-limiter): shared state, eviction, thread safety, config validation#3750

Merged
crivetimihai merged 5 commits intomainfrom
test/rate-limiter-plugin
Mar 21, 2026
Merged

fix(rate-limiter): shared state, eviction, thread safety, config validation#3750
crivetimihai merged 5 commits intomainfrom
test/rate-limiter-plugin

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

@gandhipratik203 gandhipratik203 commented Mar 19, 2026

Summary

Addresses gaps documented in issue #3740 (sub-task of #3735). The rate limiter plugin had several correctness issues that made it silently ineffective in multi-instance deployments.

Gaps closed

Gap 1 (HIGH) — No shared state across instances: each gateway process had its own `_store` dict, making the effective per-user limit `N × configured_limit` where N = number of instances. Fixed with `RedisBackend` using an atomic Lua INCR+EXPIRE script — all instances share the same counter.

Gap 2 (MEDIUM) — No store eviction: the in-memory `_store` dict grew unboundedly as unique users accumulated. Fixed with a background asyncio sweep task in `MemoryBackend` that evicts expired windows every 0.5s.

Gap 3 (MEDIUM) — No thread safety: dict access was not atomic under concurrent asyncio tasks. Fixed with `asyncio.Lock` in `MemoryBackend.allow()`.

Gap 5 (LOW) — `prompt_pre_fetch` silently ignored `by_tool` config. Fixed by adding the same `by_tool` lookup to the prompt hook.

Gap 6 (MEDIUM) — Malformed config (`"abc/m"`) raised `ValueError` at request time. Fixed with `_validate_config()` called at `init` — fails fast at startup.

Gap 7 (LOW) — Unsupported rate unit (`"60/d"`) raised `ValueError` at request time. Fixed by `_parse_rate()` raising early during `_validate_config()`.

Gap 8 (MEDIUM) — Any unexpected exception in the plugin propagated to the caller. Fixed with `try/except Exception` wrapping both hooks; logs and returns permissive result on unexpected error.

Gap 4 (sliding window) intentionally deferred — the fixed window burst-at-boundary is a known trade-off. The `test_fixed_window_burst_at_boundary` xfail test documents the gap.

Additional hardening

PII fix — violation descriptions previously included `user` and `tenant` identifiers in the human-readable string (e.g. `"Rate limit exceeded for tool search, user alice@example.com, or tenant acme-corp"`). This string is logged directly in permissive mode and embedded in `PluginViolationError` messages in enforce mode. Fixed to only include the tool/prompt name — structured metadata remains in the `details` dict.

Bypass resistance documented — three bypass vectors identified and tested:

  • Whitespace-only user identity creates its own bucket (xfail, documented)
  • Tool name case sensitivity bypasses `by_tool` limits (xfail, documented)
  • Tool name whitespace bypasses `by_tool` limits (xfail, documented)

README rewritten — the previous README described the old pre-fix implementation (in-memory only, Redis listed as a TODO). Now documents both backends, full configuration reference, response headers, multi-instance deployment guidance, and an accurate limitations table.

Architecture

  • `MemoryBackend`: `asyncio.Lock` + background TTL sweep; single-process only
  • `RedisBackend`: atomic Lua INCR+EXPIRE; shared across all gateway instances; falls back to `MemoryBackend` on Redis unavailability if `redis_fallback: true`
  • `_validate_config()`: parses all rate strings at `init`; collects all errors and raises `ValueError` with a full report
  • Both hooks: `try/except Exception` with permissive degradation on unexpected errors

Test results

Unit tests: 70 passed, 4 xfailed

  • Core enforcement, headers, multi-dimensional limiting
  • Permissive vs enforce mode (via `PluginExecutor`)
  • Redis fallback (broken Redis → memory, limit still enforced, no-fallback graceful degradation)
  • Cross-tenant isolation and counter bleed prevention
  • Header accuracy (Retry-After bounds, Reset consistency, Remaining decrement)
  • Bypass resistance (None/whitespace identity, tool name normalisation gaps)
  • PII: violation description contains no user or tenant identifiers

Integration tests: 16/16 passed (`tests/integration/test_rate_limiter.py`, `--with-integration`)

Plugin framework tests: 53/53 passed

Performance profiling (`tests/performance/test_plugins_performance.py`, 1000 iterations):

Hook Memory backend Redis backend Delta
`tool_pre_invoke` 0.047ms 0.124ms +0.077ms
`prompt_pre_fetch` 0.044ms 0.117ms +0.073ms

Redis adds ~0.075ms per hook call — well under 1ms.

Multi-instance load test (`make benchmark-rate-limiter`, 3 gateway instances, 1 user at 60 req/min = 2× the 30/m limit, 120s):

Backend Total requests Allowed Blocked Verdict
Memory (before) ~120 ~120 ~0 Limit not enforced
Redis (after) 120 38 82 (40%*) Limit correctly enforced

*40% observed vs ~50% theoretical: the first 30 requests in each 60-second window are always allowed before blocking starts. Over a longer run it converges toward ~50%.

Redis MONITOR confirmed all 3 instances write to the same atomic counter:
```
[172.18.0.8] EVAL "rl:user:admin@example.com:60" "60"
[172.18.0.9] EVAL "rl:user:admin@example.com:60" "60"
[172.18.0.10] EVAL "rl:user:admin@example.com:60" "60"
```

Closes #3740

@crivetimihai crivetimihai added bug Something isn't working SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release plugins performance Performance related items security Improves security labels Mar 20, 2026
@crivetimihai crivetimihai added this to the Release 1.0.0 milestone Mar 20, 2026
@crivetimihai
Copy link
Copy Markdown
Member

Thanks @gandhipratik203. This is a thorough and well-documented fix — correctly addresses 7 of 8 gaps from #3740 with the Redis backend, eviction, thread safety, config validation, and PII fix. Test coverage is strong (70 unit + 16 integration + multi-instance load test). Two notes: (1) the try/except Exception permissive degradation is the right call for a rate limiter, (2) the deferred sliding window gap is well-documented with xfail tests. Solid work.

@gandhipratik203 gandhipratik203 added the wxo wxo integration label Mar 20, 2026
@gandhipratik203 gandhipratik203 self-assigned this Mar 20, 2026
@crivetimihai crivetimihai added MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe and removed SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release labels Mar 20, 2026
@crivetimihai crivetimihai added the release-fix Critical bugfix required for the release label Mar 21, 2026
gandhipratik203 and others added 5 commits March 21, 2026 13:03
Closes #3740

## What changed

### Plugin fixes (plugins/rate_limiter/rate_limiter.py)
- Config validation at __init__: _validate_config() parses all rate
  strings at startup — bad config raises immediately, not mid-request
- Graceful degradation: both hooks wrapped in try/except; unexpected
  errors are logged and the request is allowed through (permissive)
- prompt_pre_fetch now enforces by_tool limits using prompt_id as key
- MemoryBackend: asyncio.Lock makes counter increments atomic
- MemoryBackend: background TTL sweep evicts expired windows (0.5s interval)
- RedisBackend: atomic INCR+EXPIRE via Lua script; shared state across
  all gateway instances; native TTL expiry; falls back to memory on error

### Test additions (tests/unit/.../test_rate_limiter.py)
- Gap tests: 4 xfail -> pass (shared state, eviction, prompt by_tool,
  graceful degradation); 1 xfail remains (fixed window burst, deferred)
- Edge case tests: malformed/unsupported config raises at init (not
  request time); runtime errors degrade gracefully via mock injection
- Redis backend test uses injected FakeRedis — no live server required

### Config changes
- plugins/config.yaml: RateLimiterPlugin enabled with enforce mode
- tests/performance/plugins/config.yaml: RateLimiterPlugin set to
  permissive for inclusion in cProfile benchmark runs

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…config.yaml

Switch the default stack config from in-memory to Redis-backed rate limiting.
This ensures the 30/m per-user limit is enforced as a true shared limit across
all gateway instances rather than 30/m per process.

Validated via Redis MONITOR: all 3 gateway instances atomically increment the
same rl:user:<id>:60 counter via the Lua INCR+EXPIRE script.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…stance correctness

Adds locustfile_rate_limiter.py and a make benchmark-rate-limiter target to
demonstrate the multi-instance rate limit enforcement gap and its fix.

The test sends 1 req/s (60 req/min = 2x the 30/m limit) through 3 gateway
instances. With a memory backend each instance only sees ~20 req/min and
never fires the limiter (~0% failures). With the Redis backend the shared
counter reaches 30/min and blocks ~50% of requests — clearly showing the fix
works across instances.

Expected results:
  Memory backend:  ~0% blocked  (each instance sees 20 req/min < 30/m limit)
  Redis backend:  ~50% blocked  (shared counter: 60 req/min > 30/m limit)

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…and updated docs

- Add 22 new unit tests (70 passed total, 4 xfailed):
  - Permissive vs enforce mode through PluginExecutor
  - Redis fallback: memory takeover when Redis is down, limit still enforced, no-fallback graceful degradation
  - Cross-tenant isolation: independent counters, no counter bleed between tenants
  - Header accuracy: Retry-After bounds, X-RateLimit-Reset future/consistency, Remaining decrement
  - Bypass resistance: None/whitespace user identity, tool name case sensitivity and whitespace (documented as xfail gaps)
  - PII: violation description must not contain user or tenant identifiers

- Fix PII leak in violation description: remove user/tenant from description string
  in both prompt_pre_fetch and tool_pre_invoke — identifiers appeared in log output
  via permissive-mode manager warning and enforce-mode PluginViolationError message

- Rewrite plugins/rate_limiter/README.md: was describing the old pre-fix
  implementation (in-memory only, no Redis, Redis as TODO). Now documents
  both backends, full config reference, response headers, examples, and
  accurate limitations table

- Update plugin-manifest.yaml description to reflect Redis backend support

Closes #3740

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
… validation

- Remove unused _allow() module-level function (dead code — plugin uses
  self._rate_backend.allow() directly)
- Fix test_graceful_degradation test: was patching _allow() which is never
  called by the plugin; now patches backend.allow() via patch.object so the
  try/except error path is actually exercised
- Add prompt_pre_fetch graceful degradation test (was only tested for
  tool_pre_invoke)
- Fix inconsistent by_tool lookup in tool_pre_invoke: remove unnecessary
  hasattr(__contains__) guard, align with prompt_pre_fetch pattern
- Add backend validation to _validate_config(): typo like 'reddis' now
  raises ValueError at startup instead of silently falling back to memory
- Add test for malformed by_tool rate string validation
- Add test for invalid backend name validation
- Change default config mode from enforce to permissive for safety
  (consistent with all other security plugins in the default config)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai force-pushed the test/rate-limiter-plugin branch from 10e120c to 222ba22 Compare March 21, 2026 13:12
Copy link
Copy Markdown
Member

@crivetimihai crivetimihai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Approved with fixes applied

Solid PR addressing real gaps in the rate limiter plugin. The Redis backend with atomic Lua INCR+EXPIRE, memory eviction sweep, asyncio.Lock, and config validation are all well-designed and consistent with the codebase's plugin patterns.

Fixes applied in review commit

Bug: Dead code removed_allow() module-level function was never called by the plugin (which uses self._rate_backend.allow()). Removed.

Bug: False-positive test fixedtest_graceful_degradation_unexpected_runtime_error_does_not_crash_caller patched _allow() (never called), so the try/except error path was never exercised. The test passed vacuously. Fixed to use patch.object(plugin._rate_backend, "allow", ...) which actually triggers the error handling.

Bug: Inconsistent by_tool lookuptool_pre_invoke had an unnecessary hasattr(by_tool_config, "__contains__") guard; prompt_pre_fetch didn't. Since by_tool is always Optional[Dict] from Pydantic, the hasattr check is redundant. Aligned both hooks.

Enhancement: Backend validation — A typo like backend: "reddis" silently fell back to memory with no error. Added validation in _validate_config() to fail fast at startup.

Config: Mode changed to permissive — All other security/safety plugins in the default plugins/config.yaml ship as permissive. Switching rate limiter to enforce would be inconsistent and could break existing deployments on update. Changed to permissive for safety — operators can opt into enforce.

Tests added — Added 3 new tests: prompt_pre_fetch graceful degradation, invalid backend validation, malformed by_tool rate validation. Coverage at 97% (remaining 3% is defensive code for no-event-loop and real Redis client init — covered by integration tests).

Verified

  • 73 unit tests pass, 4 xfail (documented gaps)
  • 1508 plugin tests pass with no regressions
  • Architecture consistent with existing plugin patterns (Config model, init, error handling)
  • No security concerns: PII removed from violation descriptions, fail-open on errors, no sensitive data in headers
  • Redis Lua script is standard atomic INCR+EXPIRE pattern
  • xfail tests properly document known gaps (sliding window, whitespace bypass, case sensitivity)

@crivetimihai crivetimihai merged commit e1b6c68 into main Mar 21, 2026
36 checks passed
@crivetimihai crivetimihai deleted the test/rate-limiter-plugin branch March 21, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe performance Performance related items plugins release-fix Critical bugfix required for the release security Improves security wxo wxo integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TESTING][PLUGINS]: Test, harden, and document the Rate Limiter plugin

2 participants