Skip to content

feat(auth): add LDAP / Active Directory authentication support#2966

Closed
crivetimihai wants to merge 2204 commits intomainfrom
ldap
Closed

feat(auth): add LDAP / Active Directory authentication support#2966
crivetimihai wants to merge 2204 commits intomainfrom
ldap

Conversation

@crivetimihai
Copy link
Copy Markdown
Member

Summary

  • Implement LDAP simple bind authentication with two-step flow (service account search + user bind) and JWT session token issuance
  • Add auto-provisioning of gateway users from LDAP directory entries with configurable group-to-role mapping
  • Add periodic background directory sync importing users and groups (as teams) with optional orphan removal
  • Provide Docker Compose infrastructure (OpenLDAP + phpLDAPadmin) and Makefile targets for local LDAP development/testing
  • Security hardening: StartTLS before bind (AUTO_BIND_TLS_BEFORE_BIND), account lockout integration, auth provider isolation (prevents LDAP takeover of local accounts), role refresh on every login, admin-only status endpoint with sanitized error messages

New files

File Description
mcpgateway/services/ldap_service.py Core LDAP service: authenticate, search, sync, user provisioning
mcpgateway/routers/ldap_auth.py REST API: /auth/ldap/login, /auth/ldap/status, /auth/ldap/sync
docker-compose.ldap.yml Compose overlay enabling LDAP with preconfigured env vars
infra/ldap/seed.ldif Demo data: 5 users + 4 groups
tests/unit/mcpgateway/services/test_ldap_service.py 47 service tests
tests/unit/mcpgateway/routers/test_ldap_auth.py 23 router tests

Modified files

File Change
mcpgateway/config.py 29 LDAP settings (URI, bind DN, TLS, sync, role mappings, etc.)
mcpgateway/schemas.py LdapLoginRequest, LdapSyncResponse, LdapStatusResponse
mcpgateway/main.py Router registration + background sync loop in lifespan
docker-compose.yml OpenLDAP + phpLDAPadmin services under ldap profile
pyproject.toml ldap3>=2.9.1 optional dependency ([ldap] extra)
.env.example Documented LDAP configuration section
Makefile compose-ldap, compose-ldap-seed, compose-ldap-down, compose-ldap-clean

Test plan

  • 70 LDAP unit tests pass (service + router)
  • Full unit test suite (11855 tests) passes with zero regressions
  • Manual: make compose-ldap && make compose-ldap-seed then test login via /auth/ldap/login
  • Manual: Verify POST /auth/ldap/sync imports users/groups as teams
  • Manual: Verify StartTLS configuration with LDAP_START_TLS=true

Closes #284

gcgoncalves and others added 30 commits January 20, 2026 21:52
* Resolve PostgreSQL GROUP BY error in analytics

Resolves a GroupingError that occurred in PostgreSQL when querying
observability data (tools, prompts, resources). The error was caused by
SQLAlchemy generating different column expressions for the SELECT list
and the GROUP BY clause when extracting data from JSON fields.

This fix refactors the queries to create a single expression for the
JSON field extraction and reuses it in both the SELECT and GROUP BY
clauses, ensuring compatibility with PostgreSQL's strict grouping requirements.

Additionally, this commit introduces:
  - Unit tests parametrized for both PostgreSQL and SQLite dialects to
  verify the fix at the query construction level.
  - An end-to-end test that calls the affected API endpoints to ensure they
  function correctly with a database backend.

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

* 2182 - Improve observability charts lifecycle

Introduces a global Chart.js registry to manage chart instances
across the admin UI, preventing "canvas already in use" errors.
Enhances observability panels (metrics, tools, prompts) by:

 - Centralizing chart destruction and registration.
 - Implementing visibility checks to defer rendering of charts on hidden canvases.
 - Adding safeguards to prevent redundant and concurrent loading of partial views.
 - Ensuring proper cleanup of charts and auto-refresh intervals when navigating
  between tabs or unloading the page.

These changes improve the stability and performance of the observability
dashboard, providing a smoother user experience.

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

* Linting and testing fix

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

* fix(tests): use proper user context in observability SQL tests

The RBAC decorator expects a dict with 'email' and 'db' keys, not a
MagicMock. Updated the tests to use create_mock_user_context() which
provides the correct structure for permission checks.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(observability): address chart lifecycle and cleanup issues

- Remove redundant chart destruction on observability tab entry
  (charts are already destroyed when leaving the tab)
- Replace ineffective 'destroy' DOM event listeners with beforeunload
  handlers (Alpine.js doesn't emit 'destroy' DOM events)
- Move e2e test before __main__ block so it runs with pytest --main
- Update observability_resources.html to use global chart registry
  for consistent chart lifecycle management
- Add resources- prefix to chart cleanup when leaving observability tab

These fixes address blank chart issues when returning to the
observability tab and ensure proper cleanup of intervals/handlers.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(observability): stop auto-refresh and reset state on tab leave

Addresses two issues identified in review:

1. Blank charts on return: When leaving observability tab, charts are
   destroyed but `metricsLoaded` etc. flags remained true, preventing
   partials from reloading. Now dispatch `observability:leave` event
   that resets loaded flags so partials reload on return.

2. Orphaned intervals: Auto-refresh timers continued running after
   leaving the tab. Now each partial (metrics, tools, prompts, resources)
   listens for `observability:leave` and calls cleanup() to stop intervals.

The cleanup lifecycle is now:
- Tab leave: dispatches observability:leave event
- Partials: listen for event, call cleanup() to stop intervals
- Parent: resets loaded flags so partials reload on next visit
- Event handlers are properly removed in cleanup() to avoid leaks

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Add email_team relationship and team property to Gateway model, mirroring
the optimization applied to Server model in PR #1962 and #1975.

Changes:
- Add email_team relationship to Gateway model (db.py)
- Add team property for convenient access to team name
- Update gateway_service.py to use joinedload(DbGateway.email_team)
- Remove _get_team_name helper method (no longer needed)
- Update tests to mock db.execute instead of db.get

This reduces database queries by ~50% for gateway operations that need
team information, eliminating the N+1 query pattern.

Closes #1994

Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
… resources (#2211)

When a gateway is activated or deactivated, the system now updates the
states of all associated prompts and resources in addition to tools.
This ensures consistent behavior across all gateway-related entities.

Changes:
- Add PromptService and ResourceService to GatewayService for state sync
- Update prompts and resources state when gateway state changes
- Add skip_cache_invalidation parameter to set_prompt_state and
  set_resource_state for batch operation optimization (consistent with
  tool_service implementation)
- Skip prompt/resource DB queries when only_update_reachable=True to
  avoid unnecessary DB work during health checks
- Fix admin UI to use 'enabled' property instead of 'isActive' for
  gateway status display
- Add unit tests for prompt and resource state synchronization
- Add unit test verifying prompts/resources are skipped for reachability-only updates


Closes #2212

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Fixes pagination controls to properly namespace and clean up query
parameters when navigating between different tables (tools, prompts,
resources) in the admin UI.

Problem: Pagination parameters were accumulating across tabs, resulting
in messy URLs like ?tools_page=3&prompts_page=2&resources_page=1

Solution: Each tab now maintains its own clean URL state. When switching
tabs, URL params are cleaned to only include the current tab's params
plus global params (team_id).

Changes:
- Fixed updateBrowserUrl() in pagination_controls.html to preserve
  existing params while updating current table's params
- Fixed updateInactiveUrlState() in admin.html to preserve params
  from other tables when toggling inactive filter
- Added getTableNamesForTab() for dynamic table detection via DOM
- Added cleanUpUrlParamsForTab() to clean stale params on tab switch

Note: Tab pagination state is not preserved across tab switches.
When returning to a tab, pagination starts fresh at page 1.

Closes #2213

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Chris PC <chrispc@li-4dc2bf4c-325d-11b2-a85c-b68e8b1fc307.ibm.com>
Co-authored-by: Chris PC <chrispc@li-4dc2bf4c-325d-11b2-a85c-b68e8b1fc307.ibm.com>
* fix tags in mcp servers

Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com>

* fix(tags): handle dict-format tags in filtering and read paths

Address issues with dict-format tag storage introduced by migration:

- Add json_contains_tag_expr() to handle both List[str] and List[Dict]
  tag formats in database queries (SQLite, PostgreSQL, MySQL)
- Update all services (gateway, tool, server, prompt, resource, a2a)
  to use json_contains_tag_expr for tag filtering
- Fix catalog_service.py to pass through dict-format tags without
  re-validation
- Restore legacy tag handling in _prepare_gateway_for_read for
  backward compatibility
- Update tests to mock json_contains_tag_expr instead of json_contains_expr

Closes #2203

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…#2253)

This commit fixes a bug where deactivating a virtual server associated with an email notification team would fail with a database error.

The root cause was the use of `joinedload` to fetch the `email_team` relationship when retrieving the server to be updated. This resulted in a `LEFT OUTER JOIN` in the underlying SQL query. When the query also included a
`FOR UPDATE` clause to lock the server row, PostgreSQL raised a `FeatureNotSupported` error because it cannot apply a lock to the nullable side of an outer join.

This fix changes the SQLAlchemy loading strategy from `joinedload` to `selectinload` for the `DbServer.email_team` relationship within the `set_server_state` method. `selectinload` resolves the issue by loading the relate
`email_team` in a separate `SELECT` statement, thus avoiding the problematic `JOIN` in the initial `SELECT ... FOR UPDATE` query.

Additionally, comprehensive unit tests have been added for both the `ServerService` and the admin panel routes to cover server activation, deactivation, and to verify that the fix works as expected, preventing future
regressions.

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>
* tag view

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* fix: Handle object tags in token details and improve fallback handling

- Add object-to-string conversion for tags in showTokenDetailsModal
  (was missed in original PR)
- Remove inconsistent blank line in viewGateway tag rendering
- Add JSON.stringify fallback for malformed tag objects without id/label
  (defense-in-depth for edge cases like DB corruption)

Part of the fix for #2267

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>

---------

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>
Co-authored-by: Mihai Criveti <crmihai1@ie.ibm.com>
* Fix bump2version

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* bump2version

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* 1.0.0-BETA-2

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: optimize CPU usage in request logging middleware

Add configuration options to reduce CPU and database overhead in
detailed request logging:

- log_detailed_skip_endpoints: List of path prefixes to skip from
  detailed logging (e.g., high-volume or low-value endpoints)
- log_resolve_user_identity: Gate DB fallback for user identity
  resolution behind opt-in flag (default: false)
- log_detailed_sample_rate: Sampling rate (0.0-1.0) to log only a
  fraction of requests when detailed logging is enabled

These optimizations avoid expensive JSON parsing, masking, and identity
lookups unless detailed logging is explicitly enabled and required.

Closes #1865

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add documentation for logging CPU optimization options

Document the new logging configuration options:
- LOG_DETAILED_SKIP_ENDPOINTS: path prefixes to skip from logging
- LOG_DETAILED_SAMPLE_RATE: sampling rate for detailed logging
- LOG_RESOLVE_USER_IDENTITY: opt-in DB lookup for user identity

Updated:
- .env.example with new options and descriptions
- README.md logging table and examples
- Helm chart values.yaml and values.schema.json
- charts/mcp-stack/README.md values table
- docs/config.schema.json (regenerated from Pydantic model)
- docs/docs/config.schema.json (regenerated from Pydantic model)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: add LOG_DETAILED_SKIP_ENDPOINTS to env list normalizer and add tests

- Add LOG_DETAILED_SKIP_ENDPOINTS to _normalize_env_list_vars() to
  support CSV format and empty string values from environment variables
- Add unit tests for skip endpoints, sampling rate, and user identity
  resolution gating in request logging middleware
- Add settings field validation tests for new config options

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* doctest coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: lower doctest coverage threshold to 34%

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…patibility (#2342)

- Use jsonschema.validators.validator_for to detect schema draft automatically
- Support multiple JSON Schema drafts (Draft 4, 6, 7, 2019-09, 2020-12)
- Log warnings for unsupported drafts or invalid schemas instead of raising errors
- Handle None schemas gracefully
- Apply consistent validation behavior to both tool and prompt schemas
- Add comprehensive tests for different schema drafts
- Add fallback validator logic in tool_service.py for runtime validation
- Disable MCP SDK's built-in input validation which uses strict Draft 2020-12

Closes #2322

Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix(db): Guard against inactive transaction during async cleanup

When registering MCP servers with long initialization times (like Moody's),
a CancelledError can occur during the MCP session teardown (DELETE request).
This causes the database transaction to become inactive before get_db()
attempts to commit, resulting in:

  sqlalchemy.exc.InvalidRequestError: This transaction is inactive

Add db.is_active checks before commit() and rollback() to handle cases
where the transaction becomes inactive during async context manager cleanup.

Closes #2341

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Add upsert logic for resources and prompts to prevent unique constraint violations

When re-registering a gateway (e.g., after deletion or crash), orphaned resources
and prompts from previous registrations could cause unique constraint violations
on `(team_id, owner_email, uri)` for resources and `(team_id, owner_email, name)`
for prompts.

This fix adds upsert logic that:
1. Queries for existing resources/prompts matching the unique constraint
2. Updates existing records instead of creating duplicates
3. Creates new records only when no match exists

This handles scenarios like:
- Gateway deletion that didn't properly clean up resources (issue #2341)
- Re-registration of the same MCP server under a new gateway name
- Race conditions during concurrent operations

Closes #2352

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: Add cleanup script for orphaned resources/prompts/tools

Adds a utility script to identify and remove database records that were
left orphaned due to incomplete gateway deletions (e.g., #2341 crash).

Usage:
  # Dry run (default) - shows what would be deleted
  python scripts/cleanup_orphaned_resources.py

  # Actually delete orphaned records
  python scripts/cleanup_orphaned_resources.py --execute

  # Filter by team or owner
  python scripts/cleanup_orphaned_resources.py --team-id <id> --owner-email <email>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Only upsert orphaned resources/prompts, add tests

Addresses code review findings:

1. HIGH: Only update truly orphaned records (gateway_id IS NULL or points
   to non-existent gateway). Resources belonging to active gateways are
   no longer at risk of being reassigned.

2. MEDIUM: Use per-resource team/owner overrides when building lookup key,
   matching exactly what would be inserted to avoid constraint mismatches.

3. LOW: Added tests for orphaned resource upsert logic:
   - test_register_gateway_updates_orphaned_resources
   - test_register_gateway_does_not_update_active_gateway_resources

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Always call rollback() in exception handler, improve tests

Addresses remaining code review findings:

1. ROLLBACK GUARD FIX:
   - REMOVED the `if db.is_active:` check before `db.rollback()`
   - Empirical testing proved that:
     * After IntegrityError, is_active becomes False
     * rollback() when is_active=False SUCCEEDS (doesn't fail!)
     * rollback() restores is_active to True, cleaning up the session
     * Skipping rollback when is_active=False leaves session unusable
   - The is_active guard for commit() is CORRECT (commit fails when False)
   - The is_active guard for rollback() was WRONG (rollback is always safe)

2. TEST ASSERTIONS:
   - Rewrote orphaned resource tests with proper assertions
   - Tests now directly verify:
     * Orphaned resources are detected and added to map
     * Resource fields are actually updated during upsert
     * Resources with deleted gateways are detected as orphaned
     * Resources with active gateways are NOT touched
     * Per-resource owner/team overrides are used in lookup key

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: Add file encoding header to cleanup script

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: RinCodeForge927 <dangnhatrin90@gmail.com>
Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
Signed-off-by: RinCodeForge927 <dangnhatrin90@gmail.com>
Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ps (#2359)

* fix(perf): resolve lock contention and CPU spin loop under high load (#2355)

Phase 1: Replace cascading FOR UPDATE loops with bulk UPDATE statements
in gateway_service.set_gateway_state() to eliminate lock contention when
activating/deactivating gateways with many tools/servers/prompts/resources.

Phase 2: Add nowait=True to get_for_update() calls in set_server_state()
and set_tool_state() to fail fast on locked rows instead of blocking.
Add ServerLockConflictError and ToolLockConflictError exceptions with
409 Conflict handlers in main.py and admin.py routers.

Phase 3: Fix CPU spin loop in SSE transport by properly detecting client
disconnection. Add request.is_disconnected() check, consecutive error
counting, GeneratorExit handling, and ensure _client_gone is set in all
exit paths.

Results:
- RPS improved from 173-600 to ~2000 under load
- Failure rate reduced from 14-22% to 0.03-0.04%
- Blocked queries reduced from 33-48 to 0
- CPU after load test: ~1% (was 800%+ spin loop)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(perf): add database lock timeout configuration (#2355)

Add configurable timeout settings for database operations:
- db_lock_timeout_ms: Maximum wait for row locks (default 5000ms)
- db_statement_timeout_ms: Maximum statement execution (default 30000ms)

These settings can be used with get_for_update() to prevent indefinite
blocking under high concurrency scenarios.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: update tests for bulk UPDATE and SSE transport changes (#2355)

Update gateway_service tests to use side_effect for multiple db.execute
calls (SELECT + bulk UPDATEs) instead of single return_value.

Update row_level_locking test to expect nowait=True parameter in
get_for_update calls for set_tool_state.

Update SSE transport tests to mock request.is_disconnected() and adjust
error handling test to expect consecutive errors causing generator stop
instead of error event emission.

Add missing exception documentation for ServerLockConflictError and
ToolLockConflictError in service docstrings (flake8 DAR401).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add send_timeout to EventSourceResponse to prevent spin loops (#2355)

When Granian ASGI server fails to send to a disconnected client, it logs
"ASGI transport error: SendError" but doesn't raise an exception to our
code. This causes rapid iteration of the generator without proper
timeout handling.

Add send_timeout=5.0 to EventSourceResponse to ensure sends time out if
they fail, triggering sse_starlette's built-in error handling.

Also enable sse_starlette's built-in ping mechanism when keepalive is
enabled, which provides additional disconnect detection.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add rapid yield detection to prevent CPU spin loops (#2355)

When clients disconnect abruptly, Granian may fail sends without
raising Python exceptions. This adds rapid yield detection: if
50+ yields occur within 1 second, we assume client is disconnected
and stop the generator.

New configurable settings:
- SSE_SEND_TIMEOUT: ASGI send timeout (default 30s)
- SSE_RAPID_YIELD_WINDOW_MS: detection window (default 1000ms)
- SSE_RAPID_YIELD_MAX: max yields before disconnect (default 50)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): log rapid yield detection at ERROR level for visibility

Changed from WARNING to ERROR so the detection message is visible
even when LOG_LEVEL=ERROR. This is appropriate since rapid yield
detection indicates a problem condition (client disconnect not
reported by ASGI).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address review findings from ChatGPT analysis

Finding 1: Fix lock conflict error propagation
- Add explicit except handlers for ToolLockConflictError and
  ServerLockConflictError before the generic Exception handler
- This allows 409 responses to propagate correctly instead of
  being wrapped as generic 400 errors

Finding 3: Improve SSE rapid yield detection
- Only track message yields, not keepalives
- Reset the timestamp deque when timeout occurs (we actually waited)
- This prevents false positives on high-throughput legitimate streams

Finding 4: Remove unused db timeout settings
- Remove db_lock_timeout_ms and db_statement_timeout_ms from config
- These settings were defined but never wired into DB operations
- Avoids false sense of protection

Finding 2 (notifications) is intentional: gateway-level notifications
are sent, and bulk UPDATE is used for performance under high load.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(perf): add nowait locks to prompt and resource state changes

Extends the lock contention fix to prompt_service and resource_service:
- Add PromptLockConflictError and ResourceLockConflictError classes
- Use nowait=True in get_for_update to fail fast if row is locked
- Add 409 Conflict handlers in main.py for both services
- Re-raise specific errors before generic Exception handler

This ensures consistent lock handling across all state change endpoints.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): improve rapid yield detection to catch all spin scenarios

- Track time since last yield as additional signal (<10ms is suspicious)
- Check rapid yield after BOTH message and keepalive yields
- Reset timestamps only after successful keepalive wait
- Include time interval in error log for debugging

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add robust spin loop detection and update dependencies (#2355)

SSE transport improvements:
- Add consecutive rapid yield counter for simpler spin loop detection
  (triggers after 10 yields < 100ms apart)
- Remove deque clearing after keepalives that prevented detection
- Add client_close_handler_callable to detect disconnects that ASGI
  servers like granian may not propagate via request.is_disconnected()

Test updates:
- Update row-level locking tests to expect nowait=True for prompt
  and resource state changes

Dependency updates:
- Containerfile.lite: Update UBI base images to latest
- gunicorn 23.0.0 -> 24.1.1
- sqlalchemy 2.0.45 -> 2.0.46
- langgraph 1.0.6 -> 1.0.7
- hypothesis 6.150.2 -> 6.150.3
- schemathesis 4.9.2 -> 4.9.4
- copier 9.11.1 -> 9.11.3
- pytest-html 4.1.1 -> 4.2.0

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add granian worker lifecycle options for SSE connection leak workaround

Document GRANIAN_WORKERS_LIFETIME and GRANIAN_WORKERS_MAX_RSS options
as commented-out configuration in docker-compose.yml and run-granian.sh.

These options provide a workaround for granian issue #286 where SSE
connections are not properly closed after client disconnect, causing
CPU spin loops after load tests complete.

Refs: #2357, #2358
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Update pyproject.toml

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* lint

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…2363)

The Export Config button was not included when pagination was
implemented in PR #1955. This button allows users to export
MCP client configuration in stdio/SSE/HTTP formats.

Closes #2362

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: add sso_entra_admin_groups to list field validator

- sso_entra_admin_groups now properly parses CSV/JSON from environment
- Closes #2265

Signed-off-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com>

* style: fix missing blank lines in test_config.py

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…scanners (#2200)

- Convert cache.py to use async redis (redis.asyncio) for non-blocking I/O
- Add parallel scanner execution using asyncio.gather in input/output filters
- Add asyncio.to_thread for CPU-bound scanner operations
- Quiet llm_guard logger to ERROR level to reduce noise
- Fix tests to use prompt_id instead of deprecated name parameter
- Update test to use environment variables for redis host/port

Security: Scanner errors now fail-closed (is_valid=False) instead of being
skipped, ensuring policy evaluation denies requests when scanners fail.

Closes #1959

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
1. Replace custom CSS classes with native tailwind utility classes.
2. Add Chart.js theming for dark-mode graphs

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: plugin template test cases.

Signed-off-by: Teryl Taylor <terylt@ibm.com>

* fix: context passing in unit test

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>

---------

Signed-off-by: Teryl Taylor <terylt@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Co-authored-by: Teryl Taylor <terylt@ibm.com>
Co-authored-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Precompile all regex patterns at module or configuration initialization
time across 14 plugins, eliminating per-request compilation overhead.

Closes #1834

Signed-off-by: Shoumi <shoumimukherjee@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
crivetimihai and others added 13 commits February 14, 2026 19:56
* chore: add linting-full workflow and pre-commit gate

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: bootstrap pre-commit with uv in pipless venvs

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: exclude known pre-commit false positives

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: support helm plugin verify toggle in linting target

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: resolve gosec findings and enforce linting gosec gate

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ult for development testing (#2949)

* feat(keycloak): add Keycloak SSO to docker-compose for development testing

- Add Keycloak service to docker-compose.yml (pinned to v26.1) with
  pre-configured realm export for local SSO development
- Add docker-compose.sso.yml overlay for SSO-specific configuration
- Add "sso" to testing and inspector service profiles so --profile sso
  brings up the full dev stack (Keycloak + testing + inspector)
- Implement SSO bootstrap utility for automatic Keycloak realm/client setup
- Add Keycloak discovery helper with well-known endpoint support
- Enhance SSO service with id_token claims fallback for split-host
  configurations (restricted to 401 + split-host detection only)
- Improve admin.py SSO logout with proper id_token_hint RP-initiated flow
- Skip expired id_token_hint in logout URL to avoid Keycloak rejection
- Add cookie size validation for id_token storage (>3.8KB warning)
- Use dynamic max_age for SSO cookies matching token_expiry setting
- Improve error handling and logging throughout SSO flows
- Add sso-keycloak-tutorial and developer workstation documentation
- Add test-sso-flow.sh script for manual SSO flow verification
- Add Makefile targets for Keycloak lifecycle management

Closes #2949

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test(keycloak): add and update tests for Keycloak SSO integration

- Add SSO service tests for Keycloak userinfo fallback, callback error
  handling, personal team resolution, and user authentication flows
- Add Keycloak discovery and SSO bootstrap unit tests
- Update SSO router tests to use handle_oauth_callback_with_tokens
- Update admin module tests for Keycloak-aware logout behavior with
  separate coverage for keycloak-enabled and keycloak-disabled paths
- Add expired id_token_hint omission test
- Add id_token cookie size and max_age validation tests

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: add original_description column to tools table (#2893)

When tools are registered, preserve the original description from the
source MCP server in a new original_description field. This allows users
to customize the tool description while retaining the original for
reference.

- Add original_description column to Tool model and ToolRead schema
- Populate original_description at tool registration time
- Include original_description in export service output
- Add Alembic migration with data backfill from existing descriptions

Closes #2893

Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: preserve custom descriptions during gateway refresh and harden original_description

- Protect user-customized descriptions during gateway sync/refresh by
  only overwriting description when it matches original_description
- Add default value (None) to ToolRead.original_description for cache
  compatibility across deployments
- Add original_description to tool cache payload for consistency
- Restore original_description during export/import cycle
- Update tests to cover description preservation behavior

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: scope import original_description restore by batch ID and original_name

- Use Tool.original_name (not computed Tool.name) for restore lookup
- Generate import_batch_id to scope restore to newly created tools only
- Skip restore when no tools were created (respects skip/fail semantics)
- Add tests for restore-by-batch and skip-no-restore behaviors
- Fix missing original_description on gateway service test mock

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: commit (not flush) import original_description restore

db.flush() without a subsequent commit loses the restore changes
since register_tools_bulk already committed its own transaction.
Verified with E2E testing against running docker-compose cluster.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Nithin Katta <Nithin.Katta@ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…tting (#2898)

Enforce the existing REQUIRE_TOKEN_EXPIRATION config setting at
token creation time. Previously the setting only validated incoming
tokens but allowed creation of tokens without expiration.

- Add validation in TokenCatalogService.create_token() to reject
  tokens without expiration when REQUIRE_TOKEN_EXPIRATION=true
- Pass require_token_expiration flag to admin UI template context
- Add conditional required field indicator and helper text in the
  token creation form
- Update existing tests to provide expires_in_days where needed
- Add new test cases covering policy enabled/disabled scenarios,
  team tokens, and edge cases (zero expiry days)

Closes #2836

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* feat(auth): implement password reset and recovery workflows

Add self-service forgot/reset password APIs and admin UI flows with one-time reset tokens, SMTP notifications, account unlock actions, lockout expiry fixes, metrics, migration, docs, and recovery tooling.

Closes #2542

Closes #2543

Closes #2628

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* lint fixes

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Test coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Fix password reset issues

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Rebase

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: mark hash_password utility executable

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
#2950)

* fix(ui): standardize loading indicators across admin pages (#2946)

Signed-off-by: Oriol Morros Vilaseca <OM368@student.aru.ac.uk>

* fix(ui): standardize Users loading indicator to match pattern

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Oriol Morros Vilaseca <OM368@student.aru.ac.uk>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…2865)

* feat: add slow-time-server for timeout, resilience, and load testing

Implements #2783. A configurable-latency Go MCP server modelled on
fast-time-server that introduces artificial delays on every tool call,
serving as a testing target for gateway timeout enforcement, circuit
breaker behaviour, session pool resilience, and load testing.

Server features:
- 5 MCP tools: get_slow_time, convert_slow_time, get_instant_time,
  get_timeout_time, get_flaky_time
- 2 MCP resources: latency://config, latency://stats
- 1 MCP prompt: test_timeout
- 4 latency distributions: fixed, uniform, normal, exponential
- Failure simulation with configurable rate and mode
- Runtime reconfiguration via REST POST /api/v1/config
- Invocation statistics with p50/p95/p99 percentiles
- Multi-transport: stdio, SSE, Streamable HTTP, dual, REST
- 32 unit tests with race detection, all passing

Integration:
- docker-compose.yml: testing profile (port 8889) + auto-registration
- docker-compose-performance.yml: dedicated performance testing service
- Locust load test with 4 scenarios (slow, timeout storm, mixed, circuit breaker)
- Documentation in docs/docs/using/servers/go/slow-time-server.md

Closes #2783

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address review issues in slow-time-server PR

- Fix healthcheck in docker-compose-performance.yml to use binary's
  built-in health check instead of curl (scratch image has no curl)
- Replace mixed atomic+mutex with plain increments under mutex in
  invocationStats.record() for clarity
- Remove dead generateTestTimeoutPrompt() function and unused strings
  import from rest_handlers.go
- Remove unused json and uuid imports from locust test file
- Align env var naming (SLOW_TIME_LATENCY) between compose files

Closes #2783

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: move slow-time-server to dedicated resilience profile

The slow-time-server deliberately introduces latency and failures,
which breaks existing tests when included in the testing profile.
Move it to a dedicated 'resilience' profile so it must be explicitly
opted into.

- docker-compose.yml: profiles ["testing"] -> ["resilience"]
- Makefile: add resilience-up/down/logs targets
- docs: update profile references

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat: add Makefile targets for resilience load testing

Add dedicated targets for running Locust and JMeter tests against
the slow-time-server, ensuring these tests only run when explicitly
invoked rather than as part of the regular testing profile.

New targets:
- resilience-locust: headless Locust run (10 users, 120s)
- resilience-locust-ui: Locust web UI on port 8090
- resilience-jmeter: JMeter baseline (20 threads, 5min)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…#2943)

* fix: add alembic upgrade validation and migration compatibility fixes

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(db): handle psycopg JSON deserialization in alembic migrations

Closes #2955

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* format

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ce suite (#2956)

* feat: implement MCP 2025-11-25 compliance suite and dated make targets

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* format

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: exclude compliance tests from default pytest and resolve gosec findings

- Add --ignore=tests/compliance to pytest addopts so compliance suite
  only runs via dedicated make targets (make 2025-11-25, etc.)
- Lazy-import mcpgateway.main in compliance conftest to avoid triggering
  bootstrap_db during test collection
- Fix gosec G114: replace bare http.ListenAndServe with http.Server
  using ReadHeaderTimeout in slow-time-server (4 instances)
- Suppress gosec G404/G705 false positives with nosec annotations

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…2724)

- Add AlreadyEncryptedError and NotEncryptedError (extend ValueError)
  for explicit validation in strict mode
- Introduce v2: format prefix for unambiguous encrypted data detection
- Add strict vs idempotent API modes (decrypt_secret vs
  decrypt_secret_or_plaintext) with backward-compatible async wrappers
- Replace length heuristic in oauth_manager with explicit is_encrypted()
- Add null checks after decryption in dcr_service update/delete
- Migrate encryption tests to dedicated test_encryption_service.py
- Add comprehensive test coverage for edge cases, concurrent operations,
  and real-world token formats (JWT, OAuth2, API keys)

Closes #2405

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Implement LDAP bind authentication, directory sync, and Docker
infrastructure for local development/testing with OpenLDAP.

- LDAP simple bind login (service account search + user bind)
- Auto-provisioning of gateway users from LDAP entries
- LDAP group-to-role mapping with configurable role_mappings
- Periodic background directory sync (users + groups → teams)
- Orphan user removal on sync (opt-in via LDAP_SYNC_DELETE_ORPHANS)
- Docker Compose profile with OpenLDAP + phpLDAPadmin + seed data
- Makefile targets: compose-ldap, compose-ldap-seed, compose-ldap-down, compose-ldap-clean
- 29 configuration settings via environment variables
- Security hardening: StartTLS before bind, account lockout,
  auth provider isolation, role refresh on login, admin-only status endpoint
- 70 unit tests (service + router)

Closes #284

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai self-assigned this Feb 16, 2026
@crivetimihai crivetimihai added the experimental Experimental features, test proposed MCP Specification changes label Feb 16, 2026
@crivetimihai crivetimihai added this to the Release 1.0.0-GA milestone Feb 16, 2026
@crivetimihai
Copy link
Copy Markdown
Member Author

Multi-domain LDAP analysis (ref #1264)

Issue #1264 requests LDAP integration with multiple Active Directory domains (e.g. corp.example.com + emea.example.com). This PR implements single-domain LDAP only. Here's the gap analysis:

Current state: single-domain

Every LDAP setting is a single scalar value — one URI, one base DN, one service account, one search scope, one role mapping dict. LdapService reads these directly throughout _build_server(), authenticate(), search_users(), etc. There is no concept of iterating over multiple domain configurations.

What multi-domain would require

Aspect Current Multi-domain need
Config model Flat scalar ldap_* settings List of domain config objects (JSON)
LdapService.__init__ Reads global settings.* Domain config param or iterates list
_build_server() One server from settings.ldap_uri Server per domain
authenticate() Searches one base DN Iterate domains or accept domain hint (EMEA\alice)
search_users/groups() One search scope Per-domain search, results tagged with source domain
sync_directory() One sync pass Per-domain sync with namespaced teams (ldap-corp-admins vs ldap-emea-admins)
/status endpoint One connection check Per-domain health matrix
Role mappings One flat dict Per-domain mappings (groups may share names across domains)
Email fallback Uses single ldap_base_dn Must use matching domain's base DN

Recommended approach for #1264

  1. Introduce an LdapDomainConfig dataclass holding per-domain settings
  2. Add a LDAP_DOMAINS JSON env var (list of domain configs), with backward compat for the existing flat ldap_* env vars as a single-domain shorthand
  3. Refactor LdapService to accept a domain config (or iterate the list)
  4. Domain hint support in login (DOMAIN\user or user@domain)

Not in scope for this PR — tracked in #1264.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai added enhancement New feature or request COULD P3: Nice-to-have features with minimal impact if left out; included if time permits labels Feb 21, 2026
@crivetimihai
Copy link
Copy Markdown
Member Author

Reopened as #3148. CI/CD will re-run on the new PR. You are still credited as the author.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

COULD P3: Nice-to-have features with minimal impact if left out; included if time permits enhancement New feature or request experimental Experimental features, test proposed MCP Specification changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE][AUTH]: LDAP / Active Directory integration