Skip to content

fix: use short-lived DB sessions in RBAC permission checks#2319

Closed
crivetimihai wants to merge 1 commit intomainfrom
fix/rbac-session-pool-exhaustion
Closed

fix: use short-lived DB sessions in RBAC permission checks#2319
crivetimihai wants to merge 1 commit intomainfrom
fix/rbac-session-pool-exhaustion

Conversation

@crivetimihai
Copy link
Copy Markdown
Member

Summary

  • Remove db parameter from get_current_user_with_permissions() to prevent holding DB sessions for entire request lifecycle
  • Update permission decorators to use fresh_db_session() for short-lived permission checks
  • Sessions are now released immediately after permission check, before slow downstream operations

Problem

Under high load (4000 concurrent users), the RBAC middleware was holding database sessions for entire request lifetimes, causing:

  • 433 "idle in transaction" connections (vs normal 40-50)
  • Max transaction age of 243 seconds
  • QueuePool exhaustion errors
  • Gateway availability dropping to 33%

Changes

File Change
mcpgateway/middleware/rbac.py Remove db from function signature/return; use fresh_db_session() in decorators
mcpgateway/admin.py Use endpoint's db parameter instead of user["db"]
Test fixtures Remove "db" from mock user contexts; add fresh_db_session mocks

Test plan

  • Unit tests pass (4863 passed)
  • E2E tests pass (187 passed)
  • Docker stack healthy
  • Load test with 4000 users shows stable idle-in-transaction count

Closes #2318

@crivetimihai
Copy link
Copy Markdown
Member Author

Rebased and Extended Fix

This PR has been rebased onto main and extended to fix a critical issue that would have caused runtime errors.

Original Issue

The original PR removed db from the get_current_user_with_permissions() return dict but 33 routes across 4 files were still accessing current_user_ctx["db"], which would cause KeyError at runtime.

Changes Made

1. Core RBAC Changes (mcpgateway/middleware/rbac.py)

  • Permission decorators (@require_permission, @require_admin_permission, @require_any_permission) now use fresh_db_session() for short-lived DB operations
  • get_current_user_with_permissions() no longer accepts or returns a db parameter
  • Proxy authentication code updated to use fresh_db_session() for user lookup
  • User context dict no longer contains "db" key

2. Router Fixes (33 routes across 4 files)

mcpgateway/routers/llm_config_router.py (13 routes):

  • create_provider, list_providers, get_provider, update_provider, delete_provider
  • set_provider_state, check_provider_health
  • create_model, list_models, get_model, update_model, delete_model, set_model_state

mcpgateway/routers/llm_admin_router.py (11 routes):

  • get_providers_partial, get_models_partial
  • set_provider_state_html, check_provider_health, delete_provider_html
  • set_model_state_html, delete_model_html
  • get_api_info_partial, admin_test_api
  • fetch_provider_models, sync_provider_models

mcpgateway/routers/email_auth.py (6 routes):

  • list_users, list_all_auth_events, create_user
  • get_user, update_user, delete_user

mcpgateway/routers/teams.py (3 routes):

  • create_team, list_teams, discover_public_teams

3. Pattern Applied

For each affected route:

# BEFORE (would KeyError):
async def some_route(
    current_user_ctx: dict = Depends(get_current_user_with_permissions),
):
    db = current_user_ctx["db"]  # KeyError!
    ...

# AFTER:
async def some_route(
    db: Session = Depends(get_db),
    current_user_ctx: dict = Depends(get_current_user_with_permissions),
):
    # db is now injected directly by FastAPI
    ...

4. Other Updates

  • mcpgateway/admin.py: Simplified db access
  • tests/unit/mcpgateway/utils/test_proxy_auth.py: Updated tests to not pass db parameter and added mock_fresh_db_session fixture

Why This Matters

The original fix addressed connection pool exhaustion by using short-lived sessions for permission checks. However, it would have broken all the admin routes (LLM config, user management, team management) that still expected db in the user context.

Testing

  • All 4966 unit tests pass
  • All 23 proxy auth tests pass
  • All 409 LLM/email/team related tests pass
  • Modules import successfully

@crivetimihai crivetimihai force-pushed the fix/rbac-session-pool-exhaustion branch from 47397a6 to 0ba41b5 Compare January 25, 2026 20:19
@crivetimihai
Copy link
Copy Markdown
Member Author

crivetimihai commented Jan 25, 2026

Fix Added

Issue: User Context Detection Could Misidentify Request Payloads

Location: mcpgateway/middleware/rbac.py lines 366-377, 512-524, 595-609

Problem: The RBAC decorators (@require_permission, @require_admin_permission, @require_any_permission) detect user context by searching for any dict with an email key. This could pick up request payloads if an endpoint accepts a plain dict body containing email.

@crivetimihai
Copy link
Copy Markdown
Member Author

Design Note: fresh_db_session() and DB Overrides

The RBAC permission decorators now use fresh_db_session() (which uses the global SessionLocal) for short-lived permission checks. This is intentional to fix the session pool exhaustion issue.

Trade-off acknowledged: If tests or future code override get_db to use a different database, fresh_db_session() will still use the default SessionLocal.

Why this is acceptable:

  1. This codebase uses a single database - no per-request DB routing
  2. Tests that need isolation already mock PermissionService.check_permission to return True, bypassing DB access entirely
  3. The short-lived session pattern is the core fix for "idle in transaction" connection accumulation

If multi-tenant DB routing is ever needed: A follow-up change could pass a session factory via user context, but this is not currently required.

@crivetimihai crivetimihai self-assigned this Jan 25, 2026
@crivetimihai crivetimihai added the do-not-merge Do NOT merge. This requires manual effort, with multiple checks! label Jan 25, 2026
Remove db parameter from get_current_user_with_permissions() and update
permission decorators to use fresh_db_session() for short-lived permission
checks. This prevents database connections from being held during slow
downstream operations, avoiding connection pool exhaustion under high load.

Changes:
- Remove db: Session = Depends(get_db) from get_current_user_with_permissions()
- Remove "db" key from user context return value
- Update require_permission, require_admin, require_any_permission decorators
  to use fresh_db_session() for permission checks
- Update PermissionChecker class to use fresh_db_session() in each method
- Update test fixtures to remove "db" from mock user contexts
- Add fresh_db_session mocks to unit tests

Closes #2318

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai
Copy link
Copy Markdown
Member Author

⚠️ Work In Progress - Do Not Merge

This PR is under active development and requires extensive testing before merge.

Changes Summary

  • RBAC permission decorators now use fresh_db_session() for short-lived permission checks
  • Removed db from user context - routes now inject db directly via Depends(get_db)
  • User context detection uses explicit parameter names instead of scanning all kwargs
  • Fixed 37 routes across 4 router files (email_auth, llm_admin_router, llm_config_router, teams)

Testing Status

  • ✅ 1141 doctests pass
  • ✅ Unit tests pass
  • ✅ E2E tests pass (110 + 17)
  • ⏳ Needs load testing to verify connection pool behavior
  • ⏳ Needs integration testing in staging environment

Known Trade-offs

  • fresh_db_session() uses global SessionLocal, not request-scoped overrides (documented in earlier comment)
  • E2E tests bypass RBAC decorators entirely (by design)

@crivetimihai crivetimihai force-pushed the fix/rbac-session-pool-exhaustion branch from 0ba41b5 to 63fba73 Compare January 25, 2026 21:45
@crivetimihai crivetimihai marked this pull request as draft January 25, 2026 21:45
@crivetimihai
Copy link
Copy Markdown
Member Author

Closing — this experimental approach needs rethinking. Will revisit if DB session issues resurface.

bradmcnew pushed a commit to bradmcnew/mcp-context-forge that referenced this pull request Feb 18, 2026
…pool exhaustion

Replace next(get_db()) with fresh_db_session() context manager in all
three database access points within TokenScopingMiddleware. Extract
query logic into dedicated helper methods (_query_team_membership and
_query_resource_ownership) for cleaner session lifecycle management.

Previously each request held 2+ concurrent connections (middleware +
endpoint handler), causing QueuePool exhaustion under load. With this
fix the middleware connection is returned to the pool before the
endpoint handler runs.

Applies the same pattern established in PR IBM#2326 and PR IBM#2319.

Closes IBM#2330

Signed-off-by: Brad McNew <brad@axite.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge Do NOT merge. This requires manual effort, with multiple checks!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][PERFORMANCE]: RBAC middleware holds DB sessions for entire request lifecycle causing pool exhaustion

1 participant