refactor(plugins): decouple plugin framework from gateway observability service#2827
refactor(plugins): decouple plugin framework from gateway observability service#2827
Conversation
Signed-off-by: RinCodeForge927 <dangnhatrin90@gmail.com> Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
Signed-off-by: RinCodeForge927 <dangnhatrin90@gmail.com> Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ps (#2359) * fix(perf): resolve lock contention and CPU spin loop under high load (#2355) Phase 1: Replace cascading FOR UPDATE loops with bulk UPDATE statements in gateway_service.set_gateway_state() to eliminate lock contention when activating/deactivating gateways with many tools/servers/prompts/resources. Phase 2: Add nowait=True to get_for_update() calls in set_server_state() and set_tool_state() to fail fast on locked rows instead of blocking. Add ServerLockConflictError and ToolLockConflictError exceptions with 409 Conflict handlers in main.py and admin.py routers. Phase 3: Fix CPU spin loop in SSE transport by properly detecting client disconnection. Add request.is_disconnected() check, consecutive error counting, GeneratorExit handling, and ensure _client_gone is set in all exit paths. Results: - RPS improved from 173-600 to ~2000 under load - Failure rate reduced from 14-22% to 0.03-0.04% - Blocked queries reduced from 33-48 to 0 - CPU after load test: ~1% (was 800%+ spin loop) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(perf): add database lock timeout configuration (#2355) Add configurable timeout settings for database operations: - db_lock_timeout_ms: Maximum wait for row locks (default 5000ms) - db_statement_timeout_ms: Maximum statement execution (default 30000ms) These settings can be used with get_for_update() to prevent indefinite blocking under high concurrency scenarios. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: update tests for bulk UPDATE and SSE transport changes (#2355) Update gateway_service tests to use side_effect for multiple db.execute calls (SELECT + bulk UPDATEs) instead of single return_value. Update row_level_locking test to expect nowait=True parameter in get_for_update calls for set_tool_state. Update SSE transport tests to mock request.is_disconnected() and adjust error handling test to expect consecutive errors causing generator stop instead of error event emission. Add missing exception documentation for ServerLockConflictError and ToolLockConflictError in service docstrings (flake8 DAR401). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add send_timeout to EventSourceResponse to prevent spin loops (#2355) When Granian ASGI server fails to send to a disconnected client, it logs "ASGI transport error: SendError" but doesn't raise an exception to our code. This causes rapid iteration of the generator without proper timeout handling. Add send_timeout=5.0 to EventSourceResponse to ensure sends time out if they fail, triggering sse_starlette's built-in error handling. Also enable sse_starlette's built-in ping mechanism when keepalive is enabled, which provides additional disconnect detection. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add rapid yield detection to prevent CPU spin loops (#2355) When clients disconnect abruptly, Granian may fail sends without raising Python exceptions. This adds rapid yield detection: if 50+ yields occur within 1 second, we assume client is disconnected and stop the generator. New configurable settings: - SSE_SEND_TIMEOUT: ASGI send timeout (default 30s) - SSE_RAPID_YIELD_WINDOW_MS: detection window (default 1000ms) - SSE_RAPID_YIELD_MAX: max yields before disconnect (default 50) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): log rapid yield detection at ERROR level for visibility Changed from WARNING to ERROR so the detection message is visible even when LOG_LEVEL=ERROR. This is appropriate since rapid yield detection indicates a problem condition (client disconnect not reported by ASGI). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: address review findings from ChatGPT analysis Finding 1: Fix lock conflict error propagation - Add explicit except handlers for ToolLockConflictError and ServerLockConflictError before the generic Exception handler - This allows 409 responses to propagate correctly instead of being wrapped as generic 400 errors Finding 3: Improve SSE rapid yield detection - Only track message yields, not keepalives - Reset the timestamp deque when timeout occurs (we actually waited) - This prevents false positives on high-throughput legitimate streams Finding 4: Remove unused db timeout settings - Remove db_lock_timeout_ms and db_statement_timeout_ms from config - These settings were defined but never wired into DB operations - Avoids false sense of protection Finding 2 (notifications) is intentional: gateway-level notifications are sent, and bulk UPDATE is used for performance under high load. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(perf): add nowait locks to prompt and resource state changes Extends the lock contention fix to prompt_service and resource_service: - Add PromptLockConflictError and ResourceLockConflictError classes - Use nowait=True in get_for_update to fail fast if row is locked - Add 409 Conflict handlers in main.py for both services - Re-raise specific errors before generic Exception handler This ensures consistent lock handling across all state change endpoints. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): improve rapid yield detection to catch all spin scenarios - Track time since last yield as additional signal (<10ms is suspicious) - Check rapid yield after BOTH message and keepalive yields - Reset timestamps only after successful keepalive wait - Include time interval in error log for debugging Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add robust spin loop detection and update dependencies (#2355) SSE transport improvements: - Add consecutive rapid yield counter for simpler spin loop detection (triggers after 10 yields < 100ms apart) - Remove deque clearing after keepalives that prevented detection - Add client_close_handler_callable to detect disconnects that ASGI servers like granian may not propagate via request.is_disconnected() Test updates: - Update row-level locking tests to expect nowait=True for prompt and resource state changes Dependency updates: - Containerfile.lite: Update UBI base images to latest - gunicorn 23.0.0 -> 24.1.1 - sqlalchemy 2.0.45 -> 2.0.46 - langgraph 1.0.6 -> 1.0.7 - hypothesis 6.150.2 -> 6.150.3 - schemathesis 4.9.2 -> 4.9.4 - copier 9.11.1 -> 9.11.3 - pytest-html 4.1.1 -> 4.2.0 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add granian worker lifecycle options for SSE connection leak workaround Document GRANIAN_WORKERS_LIFETIME and GRANIAN_WORKERS_MAX_RSS options as commented-out configuration in docker-compose.yml and run-granian.sh. These options provide a workaround for granian issue #286 where SSE connections are not properly closed after client disconnect, causing CPU spin loops after load tests complete. Refs: #2357, #2358 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docker-compose updates for GUNICORN Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docker-compose updates for GUNICORN Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docker-compose updates for GUNICORN Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Update pyproject.toml Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * lint Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: add sso_entra_admin_groups to list field validator - sso_entra_admin_groups now properly parses CSV/JSON from environment - Closes #2265 Signed-off-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com> * style: fix missing blank lines in test_config.py Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…scanners (#2200) - Convert cache.py to use async redis (redis.asyncio) for non-blocking I/O - Add parallel scanner execution using asyncio.gather in input/output filters - Add asyncio.to_thread for CPU-bound scanner operations - Quiet llm_guard logger to ERROR level to reduce noise - Fix tests to use prompt_id instead of deprecated name parameter - Update test to use environment variables for redis host/port Security: Scanner errors now fail-closed (is_valid=False) instead of being skipped, ensuring policy evaluation denies requests when scanners fail. Closes #1959 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
1. Replace custom CSS classes with native tailwind utility classes. 2. Add Chart.js theming for dark-mode graphs Signed-off-by: Gabriel Costa <gabrielcg@proton.me> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: plugin template test cases. Signed-off-by: Teryl Taylor <terylt@ibm.com> * fix: context passing in unit test Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com> --------- Signed-off-by: Teryl Taylor <terylt@ibm.com> Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com> Co-authored-by: Teryl Taylor <terylt@ibm.com> Co-authored-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Precompile all regex patterns at module or configuration initialization time across 14 plugins, eliminating per-request compilation overhead. Closes #1834 Signed-off-by: Shoumi <shoumimukherjee@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* update jwt cli with more inputs Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> * fix: prevent non-expiring tokens from invalid expires_in_days - Add ge=1 validation to TokenCreateRequest.expires_in_days schema - Add guard in _generate_token to reject expires_at in the past - Use math.ceil() and max(1, ...) to ensure exp is always set for sub-minute expirations (prevents rounding to 0) - Mark --secret and --algo CLI args as deprecated (always uses config) - Add tests for past expiry rejection and ceiling behavior This fixes a security regression where negative/zero expires_in_days could create permanent tokens instead of expired ones. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: restore --secret and --algo CLI options The --secret and --algo CLI parameters now work as optional overrides: - When provided, they override the configuration values - When not provided, JWT_SECRET_KEY and JWT_ALGORITHM from config are used This preserves backward compatibility while still defaulting to configuration-based signing for consistency. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: require --secret when --algo is specified Prevent invalid token generation by requiring --secret when --algo is provided. Using --algo alone would mix config-based keys with a different algorithm, potentially producing tokens that fail validation. Also fixes stale docstring that still referenced DEFAULT_SECRET/DEFAULT_ALGO instead of the new empty-string defaults with config fallback. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Madhav Kandukuri <madhav165@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Initial commit for filesystem server
- added stdio simple server
- implemented list_directory tool
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Improved main runner
- Added argument handler
- improve tracing
- implemented streamable-http
Signed-off-by: cafalchio <maolivei@tcd.ie>
* added tracing info for each call
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Implemented search files recursively
- use glob patterns
- Added search to tools
- Handle errors
- Improve descriptions
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added files for new functions
- Updated cargo file for file search
Signed-off-by: cafalchio <maolivei@tcd.ie>
* implemented case insensitive search: Lowercase filenames and patterns
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Implemented read_file function and added to a server
- check for Max file size 1Mb
- check if the path is a file
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added tracing for read file, improved error description
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added get_file_info
- get size, created, modified and permissions
- Added to the server
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added Read multiple files
- use read file for each content
- read async
- added to server tools
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added write_file toold func
- Write to a tempfile uuid
- rename file to actual name
- remove tempfile
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added create_directory tool
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added create_directory to server and implemented placeholder for list_allowed_directories
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Changed release config to reduce bin size
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Implemented move file function.
- fails if destination exists
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added move_file to the server
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added edit file to the server
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added edit_file
- support dry_run
- use similar to get diffs
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Adding sandbox for path ccheck
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Apply fix to reduce TOCTOU vulnerability
- atomic write, no checks and write
Signed-off-by: cafalchio <maolivei@tcd.ie>
* improve read to reduce TOCTOU vulnerability
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Stopped to follow symlink on search (security)
Signed-off-by: cafalchio <maolivei@tcd.ie>
* removed unused import
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Improved Sandbox for TOCTOU safety
- initialize sandbox once
- check if new folders are inside root
- resolve path inside root
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added get_roots for server list_allowed_directories
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Using sanbox resolve path before ger file info
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added sandox to write_file and create_directory
- validade parent folder
- clean tempfile after
- check new folder inside root
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added sandbox to write file
- canonicalize and check new folders
- check parent folders
- on create directory, check if exists
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added sandbox check for list_directory
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added sandbox check for edit_file and move_file
- check and canonicalize destination parent
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Apply sandbox checks for read_file and read_multiple_files
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Changed sandbox initialization from global to context
- removed global sandbox
- initialize sandbox in main and pass to each function
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added tests for searc_files and list_directories
- test for symlinks
- test for path outside roots
- > 95% coverage
- formatted using fmt
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Formatted files using cargo fmt
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added tests for get_file_info
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added test coverage for read_file and read_multiple_files
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added unit test coverage for write_file and create_directory
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Improve test coverage for edit_file and move_file
Signed-off-by: cafalchio <maolivei@tcd.ie>
* format edit.rs
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added test coverage for sandbox
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Improve server runner
- addded sever gracefully shutdown
- Declare Config values in main
- Improve server logs
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Improved logs for list_allowed_directories and linted file
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Improve test coverage for server
- iimproved logs in server
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Linted using cargo clippy
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Simplified main.rs and moved server code to lib.rs for integration tests
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added integration tests to simulate workflows
- file and folder manipulation workflow
- permission and metadata workflow
- search and organise workflow
- server tests
Signed-off-by: cafalchio <maolivei@tcd.ie>
* formatted and clippped integration tests
Signed-off-by: cafalchio <maolivei@tcd.ie>
* added result where it was missing in tool call result
- rename all outputs to result
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Normalized write file and create directory output
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Structured search and write outputs
- update tests
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Fixed write result not showing errors correctly
- server will return WriteResult
- create_directory return err or string
- fixed test create_directory tests
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Normalized output result for write_file
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Normalized tool result outpus
- Keep consistency between tools
- Return MPC error or success
- reorganised tools between files
- updated tests
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added server start banner
Signed-off-by: cafalchio <maolivei@tcd.ie>
* fixed main receiving multiple roots
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Clipped and formatted
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Removed justfile and added Makefile
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added correct readme
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Added dockerfile
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Improved tracing logs
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Update test coverage and binary size in readme
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Removed umused dependency
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Updated cargo dependencies
Signed-off-by: cafalchio <maolivei@tcd.ie>
* Updted deprecated InitializedRequestParam
Signed-off-by: cafalchio <maolivei@tcd.ie>
* fix: correct error messages and documentation in filesystem server
- Fix misleading error messages in server.rs that said "Error writing file"
when the actual operation was read, move, or get_file_info
- Update README to accurately reflect that move_file overwrites destination
(previously incorrectly stated it fails)
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
---------
Signed-off-by: cafalchio <maolivei@tcd.ie>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…ts (#2345) * Fix proxy authentication Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com> * Fix pylint errors Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com> * fix: Correct lint issues in proxy auth tests - Add missing blank line between test classes - Remove unused jwt import - Fix excess blank lines Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: Include plugin context in proxy auth for cross-hook sharing Add plugin_context_table and plugin_global_context to proxy authentication paths, matching the JWT authentication path. This ensures HTTP_AUTH_CHECK_PERMISSION hooks can access context set by HTTP_PRE_REQUEST hooks when using proxy authentication. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: Address security concerns in proxy authentication 1. RBAC now checks auth_required when proxy header missing - Returns 401 for API requests, 302 redirect for browsers - Aligns HTTP behavior with WebSocket auth 2. Block anonymous users from token management - Add auth_method=="anonymous" to _require_interactive_session - Prevents token access when proxy header missing 3. Lookup proxy user admin status from database - Check platform_admin_email for admin match - Query EmailUser table for is_admin status - Enables plugin permission hooks to work correctly Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: Align require_auth with RBAC proxy enforcement Update require_auth to check auth_required when proxy header is missing, matching the RBAC/WebSocket behavior. Previously returned "anonymous" even when auth_required=true. - Raise 401 when mcp_client_auth_enabled=false and no proxy header if auth_required=true - Update tests to cover both auth_required=true and false cases Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mohan Lakshmaiah <mohalaks@in.ibm.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>
* fix: FastMCP compatibility
* fix: normalize issuer URL for metadata validation and caching
The original trailing slash fix introduced a bug where the issuer
validation would fail when the server returned an issuer without
trailing slash but the client passed one (or vice versa).
Changes:
- Normalize both the input issuer and metadata issuer for comparison
- Use normalized issuer as cache key for consistent cache lookup
- Add tests for trailing slash normalization scenarios
- Update test to expect refresh_token in grant_types
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: complete issuer normalization and conditional refresh_token
Address review feedback:
1. Normalize issuer consistently across the entire DCR flow:
- Allowlist validation uses normalized comparison
- Storage uses normalized issuer
- Lookup uses normalized issuer
2. Make refresh_token conditional on AS support:
- Check grant_types_supported in AS metadata
- Only request refresh_token if AS advertises support
3. Fix grant_types fallback:
- Use requested grant_types as fallback when AS response omits them
- Previously hardcoded to ["authorization_code"] which dropped refresh_token
4. Add comprehensive tests:
- Test refresh_token inclusion when AS supports it
- Test grant_types fallback behavior
- Test allowlist trailing slash normalization
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: handle null grant_types_supported and add issuer normalization migration
Address additional review findings:
1. Fix TypeError when grant_types_supported is explicit null:
- Use `metadata.get("grant_types_supported") or []` instead of
`metadata.get("grant_types_supported", [])`
- The latter returns None when key exists with null value
2. Add configurable permissive refresh_token mode:
- New setting: dcr_request_refresh_token_when_unsupported
- Default: False (strict mode - only request if AS advertises support)
- When True: request refresh_token if AS omits grant_types_supported
3. Add Alembic migration to normalize legacy issuer values:
- Strips trailing slashes from registered_oauth_clients.issuer
- Idempotent and works with SQLite and PostgreSQL
- Prevents duplicate registrations from legacy rows
4. Add comprehensive tests:
- Test explicit null grant_types_supported handling
- Test permissive refresh_token mode
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* docs: add DCR_REQUEST_REFRESH_TOKEN_WHEN_UNSUPPORTED to documentation
Update documentation for new DCR refresh token configuration option:
- README.md: Add to DCR settings table
- charts/mcp-stack/values.yaml: Add with comment
- charts/mcp-stack/README.md: Regenerated via helm-docs
- docs/docs/manage/dcr.md: Add env var and behavior note
- docs/docs/config.schema.json: Add schema definition
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
---------
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: add ARM64 load testing support - Add build section for fast-time-server to support ARM64 architecture - Use pre-built ghcr.io image by default for x86_64 performance - ARM64 users can build locally via environment variable override - Fix Dockerfile to use TARGETARCH for proper cross-compilation Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Lint Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…2314) Signed-off-by: Satya <tsp.0713@gmail.com>
…ched Templates (#2333) * Optimize SQLite JSON tag filtering with deterministic binds and cached templates Signed-off-by: Satya <tsp.0713@gmail.com> * feat: add tag filtering support to list resources template in main apis(non-template) Signed-off-by: Satya <tsp.0713@gmail.com> * removed unused fields - page, limit from list resource template from resource services Signed-off-by: Satya <tsp.0713@gmail.com> * fix: remove debug print statements from tool_service.py Remove debugging print statements that were accidentally left in the tag filtering code path. These were outputting query details to stdout which is not appropriate for production code. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Lint Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: use column-specific bind prefixes to prevent parameter collision When multiple json_contains_tag_expr calls are combined in the same query (e.g., filtering on tags from different columns), the fixed bind names (:p0, :p1) would collide and overwrite parameters. This fix adds column-specific prefixes to bind parameter names (e.g., :tools_tags_p0, :resources_tags_p0) to ensure uniqueness when composing multiple tag filter predicates. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add coverage for json_contains_tag_expr and resource template filters Add comprehensive tests for: - _sanitize_col_prefix helper function - json_contains_tag_expr for SQLite with match_any and match_all - Bind parameter collision prevention when combining multiple tag filters - LRU caching of SQL templates - New list_resource_templates filtering parameters (tags, visibility, include_inactive) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: use thread-safe counter for fully unique bind prefixes Address edge cases where bind parameters could still collide: 1. Same column filtered multiple times in one query 2. Different column refs that sanitize to identical strings (e.g., "a_b.c" and "a.b_c" both become "a_b_c") Replace static column-based prefix with a thread-safe counter that generates truly unique prefixes per call (e.g., "tools_tags_42_p0"). This removes the LRU caching of templates since each call now has a unique prefix, but ensures correctness in all edge cases. Add test for same-column collision scenario. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Satya <tsp.0713@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* optimize response_cache_by_prompt lookup with inverted index Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * fix type hint Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * flake8 fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * test: add unit tests for response_cache_by_prompt inverted index Add comprehensive test coverage for the inverted index optimization: - Tokenization and vectorization functions - Basic cache store and hit functionality - Inverted index population and candidate filtering - Eviction and index rebuild scenarios - Max entries cap with index consistency Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Shoumi <shoumimukherjee@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: Add Gateway permission constants Add GATEWAYS_CREATE, GATEWAYS_READ, GATEWAYS_UPDATE, and GATEWAYS_DELETE permission constants to the Permissions class for consistency with other resource types (tools, resources, prompts, servers). Note: The original PR #2186 attempted to fix issue #2185 by modifying the visibility query logic, but that change was incorrect. The team filter should only show resources BELONGING to the filtered team, not all public resources globally. See todo/rbac.md for documentation. Issue #2185 needs further investigation - the reported bug may have a different root cause. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat: Add gateway permission patterns to token scoping middleware Add gateway routes to token scoping middleware for consistent permission enforcement: - Add gateway pattern to _RESOURCE_PATTERNS for ID extraction - Add gateway CRUD patterns to _PERMISSION_PATTERNS: - POST /gateways (exact) -> gateways.create - POST /gateways/{id}/... (sub-resources) -> gateways.update - PUT/DELETE -> gateways.update/delete - Add gateway handling in _check_resource_team_ownership: - Public: accessible by all - Team: accessible by team members - Private: owner-only access (per RBAC doc) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: Enforce owner-only access for private visibility across all resources Per RBAC doc, private visibility means "owner only" - not "team members". Fixed private visibility checks for all resource types to validate owner_email == requester instead of team membership: - Servers - Tools - Resources - Prompts - Gateways (already correct from previous commit) This aligns token scoping middleware with the documented RBAC model. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: Add tests for gateway permissions and visibility RBAC Add unit tests covering: - Gateway permission patterns (POST create vs POST update sub-resources) - Private visibility enforces owner-only access - Team visibility allows team members only - Public visibility allows all authenticated users These tests validate the RBAC fixes in token scoping middleware. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat-2187: add additional default roles while bootstrap
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* feat-2187: fix lint issues
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* feat-2187: fixing review comments
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* feat-2187: fixing review comments
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* feat-2187: test fix
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
* fix: Improve bootstrap roles validation and documentation
Fixes identified by code review:
1. Path resolution: Fixed parent.parent.parent -> parent.parent to correctly
resolve project root from mcpgateway/bootstrap_db.py
2. JSON validation: Added validation that loaded JSON is a list of dicts with
required keys (name, scope, permissions). Invalid entries are skipped with
warnings instead of crashing bootstrap.
3. Improved logging: Log all attempted paths when file not found
Added tests:
- test_bootstrap_roles_with_dict_instead_of_list: Validates error when JSON is
a dict instead of array
- test_bootstrap_roles_with_missing_required_keys: Validates warning when roles
are missing required fields
Added documentation:
- docs/docs/manage/rbac.md: New "Bootstrap Custom Roles" section with
configuration examples for Docker Compose and Kubernetes
- docs/docs/architecture/adr/036-bootstrap-custom-roles.md: ADR documenting
the feature design, error handling, and security considerations
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: Make description and is_system_role optional for bootstrap roles
ChatGPT review identified that description and is_system_role were accessed
unconditionally via role_def["key"], causing KeyError for minimal roles.
Fix:
- Use role_def.get("description", "") with empty string default
- Use role_def.get("is_system_role", False) with False default
Added test:
- test_bootstrap_roles_with_minimal_valid_role: Verifies a role with only
required fields (name, scope, permissions) is created successfully with
correct defaults for optional fields
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
---------
Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Nithin Katta <Nithin.Katta@ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…y blockers (#2394) * Remove last 2 security issues from Sonarqube Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com> * Remove 5 of 8 blocker maintainability issues Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com> * Correct linting errors Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com> --------- Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
…ad (#2157) * perf(crypto): offload Argon2/Fernet to threadpool via asyncio.to_thread Add async wrappers (hash_password_async, verify_password_async, encrypt_secret_async, decrypt_secret_async) and update all call sites to use them, preventing event loop blocking during CPU-intensive crypto operations. Closes #1836 Signed-off-by: ESnark <31977180+ESnark@users.noreply.github.com> * fix(tests): update tests for async crypto operations Update test mocks to use async versions of password service and encryption service methods (hash_password_async, verify_password_async, encrypt_secret_async) following the changes in the crypto offload PR. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sso): add missing await for async create/update provider methods The crypto offload PR made SSOService.create_provider() and update_provider() async, but forgot to update call sites: - mcpgateway/routers/sso.py: add await in admin endpoints - mcpgateway/utils/sso_bootstrap.py: convert to async, add awaits - mcpgateway/main.py: make attempt_to_bootstrap_sso_providers async Without this fix, the router endpoints would return coroutine objects instead of provider objects, causing runtime errors (500) when accessing provider.id. The bootstrap would silently skip provider creation with "coroutine was never awaited" warnings. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test(crypto): add tests for async crypto wrappers and SSO bootstrap Add test coverage for the async crypto operations introduced by the crypto offload PR: - test_async_crypto_wrappers.py: Tests for hash_password_async, verify_password_async, encrypt_secret_async, decrypt_secret_async including roundtrip verification and sync/async compatibility - test_sso_bootstrap.py: Tests for async SSO bootstrap ensuring create_provider and update_provider are properly awaited Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: ESnark <31977180+ESnark@users.noreply.github.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* chore-2193: add Rocky Linux setup script Add setup script for Rocky Linux and RHEL-compatible distributions. Adapts the Ubuntu setup script with the following changes: - Use dnf package manager instead of apt - Docker CE installation via RHEL repository - OS detection for Rocky, RHEL, CentOS, and AlmaLinux - Support for x86_64 and aarch64 architectures Closes #2193 Signed-off-by: Jonathan Springer <jps@s390x.com> * chore-2193: add Docker login check before compose-up Check if Docker is logged in before running docker-compose to avoid image pull failures. If not logged in, prompt user with options: - Interactive login (username/password prompts) - Username with password from stdin (for automation) - Skip login (continue without authentication) Supports custom registry URLs for non-Docker Hub registries. Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: add non-interactive mode and git repo check to setup scripts Apply to both Rocky and Ubuntu setup scripts: - Add -y/--yes flag for fully non-interactive operation - Check for .git directory before running git pull - Fail fast with clear error if directory exists but isn't a git repo - Auto-confirm prompts in non-interactive mode - Exit with error on unsupported OS in non-interactive mode Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix-2360: prevent asyncio CPU spin loop after SSE client disconnect Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker. Changes: - Add _respond_tasks dict to track respond tasks by session_id - Cancel respond tasks explicitly before session cleanup in remove_session() - Cancel all respond tasks during shutdown() - Pass disconnect callback to SSE transport for defensive cleanup - Convert database backend from fire-and-forget to structured concurrency The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect, and awaited to completion, preventing orphaned tasks from spinning the event loop. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: additional fixes for CPU spin loop after SSE disconnect Follow-up fixes based on testing and review: 1. Cancellation timeout escalation (Finding 1): - _cancel_respond_task() now escalates on timeout by calling transport.disconnect() - Retries cancellation after escalation - Always removes task from tracking to prevent buildup 2. Redis respond loop exit path (Finding 2): - Changed from infinite pubsub.listen() to timeout-based get_message() polling - Added session existence check - loop exits if session removed - Allows loop to exit even without cancellation 3. Generator finally block cleanup (Finding 3): - Added on_disconnect_callback() in event_generator() finally block - Covers: CancelledError, GeneratorExit, exceptions, and normal completion - Idempotent - safe if callback already ran from on_client_close 4. Added load-test-spin-detector make target: - Spike/drop pattern to stress test session cleanup - Docker stats monitoring at each phase - Color-coded output with pass/fail indicators - Log file output to /tmp Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: fix race condition in sse_endpoint and add stuck task tracking Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task was created AFTER create_sse_response(). If client disconnected during response setup, the disconnect callback ran before the task existed, leaving it orphaned. Now matches utility_sse_endpoint ordering: 1. Compute user_with_token 2. Create and register respond task 3. Call create_sse_response() Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't be cancelled after escalation. Previously these were dropped from tracking entirely, losing visibility. Now they're moved to _stuck_tasks for monitoring and final cleanup during shutdown(). Updated tests to verify escalation behavior. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add SSE failure cleanup, stuck task reaper, and full load test Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response() fails. Added try/except around create_sse_response() in both sse_endpoint and utility_sse_endpoint - on failure, calls remove_session() to clean up the task and session before re-raising. Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to: - Remove completed tasks from _stuck_tasks - Retry cancellation for still-stuck tasks - Prevent memory leaks from tasks that eventually complete Finding 3 (LOW): Added test for escalation path with fake transport to verify transport.disconnect() is called during escalation. Also added tests for the stuck task reaper lifecycle. Also updated load-test-spin-detector to be a full-featured test matching load-test-ui with JWT auth, all user classes, entity ID fetching, and the same 4000-user baseline. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: improve load-test-spin-detector output and reduce cycle sizes - Reduce logging level to WARNING to suppress noisy worker messages - Only run entity fetching and cleanup on master/standalone nodes - Reduce cycle sizes from 4000 to 1000 peak users for faster iteration - Update banner to reflect new cycle pattern (500 -> 750 -> 1000) - Remove verbose JWT token generation log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: address remaining CPU spin loop findings Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE endpoints. In Python 3.8+, CancelledError inherits from BaseException, not Exception, so the previous except block wouldn't catch it. Now cleanup runs even when requests are cancelled during SSE handshake. Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None to prevent tight loop. The loop now has guaranteed minimum sleep even when Redis returns immediately in certain states. Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops to exit early. remove_session() now marks the session as closing BEFORE attempting task cancellation, so the respond loop (Redis and DB backends) can exit immediately without waiting for the full cancellation timeout. Finding 4 (LOW): Already addressed in previous commit with test test_cancel_respond_task_escalation_calls_transport_disconnect. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: make load-test-spin-detector run unlimited cycles - Cycles now repeat indefinitely instead of stopping after 5 - Fixed log file path to /tmp/spin_detector.log for easy monitoring - Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts - Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)") - Banner shows monitoring command: tail -f /tmp/spin_detector.log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add asyncio.CancelledError to SSE endpoint Raises docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: remove redundant asyncio.CancelledError handlers CancelledError inherits from BaseException in Python 3.8+, so it won't be caught by 'except Exception' handlers. The explicit handlers were unnecessary and triggered pylint W0706 (try-except-raise). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: restore asyncio.CancelledError in Raises docs for inner handlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add sleep on non-message Redis pubsub types to prevent spin Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(pubsub): replace blocking listen() with timeout-based get_message() The blocking `async for message in pubsub.listen()` pattern doesn't respond to asyncio cancellation properly. When anyio's cancel scope tries to cancel tasks using this pattern, the tasks don't respond because the async iterator is blocked waiting for Redis messages. This causes anyio's `_deliver_cancellation` to continuously reschedule itself with `call_soon()`, creating a CPU spin loop that consumes 100% CPU per affected worker. Changed to timeout-based polling pattern: - Use `get_message(timeout=1.0)` with `asyncio.wait_for()` - Loop allows cancellation check every ~1 second - Added sleep on None/non-message responses to prevent edge case spins Files fixed: - mcpgateway/services/cancellation_service.py - mcpgateway/services/event_service.py Closes #2360 (partial - additional spin sources may exist) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): add timeouts to __aexit__ calls to prevent CPU spin loops The MCP session/transport __aexit__ methods can block indefinitely when internal tasks don't respond to cancellation. This causes anyio's _deliver_cancellation to spin in a tight loop, consuming ~800% CPU. Root cause: When calling session.__aexit__() or transport.__aexit__(), they attempt to cancel internal tasks (like post_writer waiting on memory streams). If these tasks don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop. Changes: - Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py - Wrap all __aexit__ calls in asyncio.wait_for() with timeout - Add timeout to pubsub cleanup in session_registry.py and registry_cache.py - Add timeout to streamable HTTP context cleanup in translate.py This is a continuation of the fix for issue #2360. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(config): make session cleanup timeout configurable Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to control how long cleanup operations wait for session/transport __aexit__ calls to complete. Clarification: This timeout does NOT affect tool execution time (which uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions to prevent CPU spin loops when internal tasks don't respond to cancel. Changes: - Add mcp_session_pool_cleanup_timeout to config.py - Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs - Add to charts/mcp-stack/values.yaml - Update mcp_session_pool.py to use _get_cleanup_timeout() helper - Update session_registry.py and registry_cache.py to use config - Update translate.py to use config with fallback When to adjust: - Increase if you see frequent "cleanup timed out" warnings in logs - Decrease for faster shutdown (at risk of resource leaks) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add deadline to cancel scope to prevent CPU spin loop Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at 100% CPU when SSE task group tasks don't respond to cancellation. Root cause: When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps rescheduling itself in a tight loop. Fix: Override EventSourceResponse.__call__ to set a deadline on the cancel scope when cancellation starts. This ensures that if tasks don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds), the scope times out instead of spinning indefinitely. References: - agronholm/anyio#695 - anthropics/claude-agent-sdk-python#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(translate): use patched EventSourceResponse to prevent CPU spin translate.py was importing EventSourceResponse directly from sse_starlette, bypassing the patched version in sse_transport.py that prevents the anyio _deliver_cancellation CPU spin loop (anyio#695). This change ensures all SSE connections in the translate module (stdio-to-SSE bridge) also benefit from the cancel scope deadline fix. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): reduce cleanup timeouts from 5s to 0.5s With many concurrent connections (691 TCP sockets observed), each cancelled SSE task group spinning for up to 5 seconds caused sustained high CPU usage. Reducing the timeout to 0.5s minimizes CPU waste during spin loops while still allowing normal cleanup to complete. The cleanup timeout only affects cleanup of cancelled/released connections, not normal operation or tool execution time. Changes: - SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds - mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds - Updated .env.example and charts/mcp-stack/values.yaml Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(cleanup): make SSE cleanup timeout configurable with safe defaults - Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s) - Make sse_transport.py read timeout from config via lazy loader - Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default - Override both to 0.5s in docker-compose.yml for testing The 5.0s default is safe for production. The 0.5s override in docker-compose.yml allows testing aggressive cleanup to verify it doesn't affect normal operation. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(gunicorn): reduce max_requests to recycle stuck workers The MCP SDK's internal anyio task groups don't respond to cancellation properly, causing CPU spin loops in _deliver_cancellation. This spin happens inside the MCP SDK (streamablehttp_client, sse_client) which we cannot patch. Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are recycled frequently, cleaning up any accumulated stuck task groups. Root cause chain observed: 1. PostgreSQL idle transaction timeout 2. Gateway state change failures 3. SSE connections terminated 4. MCP SDK task groups spin (anyio#695) This is a workaround until the MCP SDK properly handles cancellation. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(anyio): monkey-patch _deliver_cancellation to prevent CPU spin Root cause: anyio's _deliver_cancellation has no iteration limit. When tasks don't respond to CancelledError, it schedules call_soon() callbacks indefinitely, causing 100% CPU spin (anyio#695). Solution: - Monkey-patch CancelScope._deliver_cancellation to track iterations - Give up after 100 iterations and log warning - Clear _cancel_handle to stop further call_soon() callbacks Also switched from asyncio.wait_for() to anyio.move_on_after() for MCP session cleanup, which better propagates cancellation through anyio's cancel scope system. Trade-off: If cancellation gives up after 100 iterations, some tasks may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000 worker recycling will eventually clean up orphaned tasks. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(anyio): make _deliver_cancellation patch optional and disabled by default The anyio monkey-patch is now feature-flagged and disabled by default: - ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default) - ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100 This allows testing performance with and without the patch, and easy rollback if upstream anyio/MCP SDK fixes the issue. Added: - Config settings for enabling/disabling the patch - apply_anyio_cancel_delivery_patch() function for explicit control - remove_anyio_cancel_delivery_patch() to restore original behavior - Documentation in .env.example and docker-compose.yml To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add comprehensive CPU spin loop mitigation documentation (#2360) Add multi-layered documentation for CPU spin loop mitigation settings across all configuration files. This ensures operators understand and can tune the workarounds for anyio#695. Changes: - .env.example: Add Layer 1/2/3 headers with cross-references to docs and issue #2360, document all 6 mitigation variables - README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers, configuration tables, and tuning tips - docker-compose.yml: Consolidate all mitigation variables into one section with SSE protection (Layer 1), cleanup timeouts (Layer 2), and experimental anyio patch (Layer 3) - charts/mcp-stack/values.yaml: Add comprehensive mitigation section with layer documentation and cross-references - docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide with root cause analysis, 4-layer defense diagram, configuration tables, diagnostic commands, and tuning recommendations - docs/docs/.pages: Add Operations section to navigation - docs/docs/operations/.pages: Add nav for operations docs Mitigation variables documented: - Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX - Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT - Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS Related: #2360, anyio#695, claude-agent-sdk#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(loadtest): aggressive spin detector with configurable timings Update spin detector load test for faster issue reproduction: - Increase user counts: 4000 → 4000 → 10000 pattern - Fast spawn rate: 1000 users/s - Shorter wait times: 0.01-0.1s between requests - Reduced connection timeouts: 5s (fail fast) Related: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * compose mitigation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * load test Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to cancel_on_finish for interrogate coverage Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* chore: replace copier with cookiecutter for template scaffolding (#2361) Copier pulled jinja2-ansible-filters (GPL-licensed). Cookiecutter is a simpler, widely-adopted alternative with no problematic transitive deps. - Migrate all 4 template sets (Go server, Python server, native plugin, external plugin) from copier YAML configs to cookiecutter.json with {{cookiecutter.*}} directory conventions - Update mcpplugins CLI to use cookiecutter.main.cookiecutter() API, add --template_type flag to select native/external plugin templates - Replace copier>=9.11.3 with cookiecutter>=2.6.0 in pyproject.toml and tox.ini - Update scaffold shell scripts to invoke cookiecutter CLI - Update tests, AGENTS.md, llms docs, and roadmap Closes #2361 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: exclude cookiecutter template dirs from pre-commit linting Template files containing Jinja2 syntax ({{ cookiecutter.* }}, {% set %}) are invalid YAML/TOML/Python from linters' perspective. Add plugin_templates/ to the global exclude in both pre-commit configs and yamllint. Also restore executable bit on run-server.sh and update uv.lock. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: exclude cookiecutter templates from gosec and tomllint Template go.mod and pyproject.toml contain Jinja2 variables that are invalid from the perspective of Go tooling and TOML parsers. Skip template directories in gosec, govulncheck, and tomllint find commands. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: exclude cookiecutter templates from tomllint CI workflow The lint.yml workflow runs tomlcheck directly via find, not through the Makefile target. Add the same template directory exclusions. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* feat(ui): unified admin search with tags + global search Implement a unified, consistent search experience across the Admin UI: - Standardize search endpoints with unified response shape (items, count, entity_type, query, filters_applied) while preserving legacy keys - Add tag filtering via comma=OR and plus=AND semantics with bounds (max 20 groups, 10 terms per group) - Server-side panel search for all entity panels via HTMX partial reloads with debounced input and namespaced URL params for shareability - Global search modal (Ctrl/Cmd+K) aggregating results across servers, gateways, tools, resources, prompts, agents, teams, and users - Escape SQL LIKE wildcards in all search filters to prevent injection - Consistent search field coverage between search and partial endpoints Closes #2076 Closes #2109 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(metrics): prevent duplicate Prometheus metric registration in multi-app tests Guard setup_metrics() against re-registration errors when multiple FastAPI apps are instantiated in the same process (e.g., test suites). Uses best-effort registry lookup via _get_registry_collector() to reuse existing collectors instead of crashing on ValueError. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(admin): eliminate redundant get_user_teams calls in unified search Add _get_user_team_ids helper with caching support. The unified search endpoint pre-fetches team IDs once and injects them into the user context, avoiding 6 redundant database lookups (one per entity type). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(admin): use ESCAPE clause for LIKE queries and update Playwright search waits SQLAlchemy .contains() does not add an ESCAPE clause, so escaped wildcards (\_) are treated as literal two-character sequences on SQLite. Replace all .contains(_escape_like(...)) calls with _like_contains() which generates .like(..., escape='\\') for correct behaviour on all database backends. Update Playwright page objects and tests to wait for the HTMX partial responses now that search is server-side instead of client-side. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * format Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(tests): use networkidle wait for HTMX search in Playwright tests Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(admin): format replaceChild call for Prettier compliance Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Code review Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Code review Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Code review Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Test networkwait fix Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting and permissions updates Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting and permissions updates Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting and test updates + fix UI logout issue Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * flake8 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* feat(ui): flexible admin UI sections for embedded contexts Add env controls to hide UI sections/header items (including embedded mode defaults). Support per-request ?ui_hide= overrides persisted via cookie and prevent hidden sections from loading data. Update Helm values/schema, docker-compose, docs, and add unit + JS tests. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(ui): restore default tab fallback Prefer the Overview tab when available and fall back to Gateways for minimal DOM states. Align JS tab tests with the restored default tab behavior. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(ui): review fixes for admin UI section visibility Deduplicate UI_HIDABLE_SECTIONS, UI_HIDABLE_HEADER_ITEMS, and UI_HIDE_SECTION_ALIASES constants — define once in config.py, import in admin.py. Harden normalizeTabName() against CSS selector injection by restricting to [a-z0-9-] characters. Add max_age (30 days) to the ui_hide_sections cookie so preferences persist across browser sessions. Filter global search results to exclude hidden sections. Expand JS test coverage: normalizeTabName edge cases, admin-only tab blocking, idempotency guard, getDefaultTabName priority, isTabAvailable, getUiHiddenSections/Tabs normalization, and global search filtering. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test(ui): add comprehensive test coverage for admin UI section visibility Cover _normalize_ui_hide_values, cookie set/delete behavior, section-aware data loading, config validators, isTabHidden, and resolveTabForNavigation. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(config): use NoDecode class instead of instance for env parsing NoDecode() (instance) in Annotated metadata is not found by pydantic-settings' `NoDecode in field.metadata` check, causing json.loads to be called on raw CSV strings. Use NoDecode (class) which matches correctly. Also adds ADR-040 and admin UI customization guide. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Fixes Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(ui): expand admin section hiding coverage Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test(metrics): stabilize db engine detection setup test Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(playwright): stabilize admin ui tab and search flows Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Fixes Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
- S7502: store asyncio task references to prevent GC (resource_cache, token_catalog_service, email_auth_service) - S5655: replace SimpleNamespace with SSOProviderContext dataclass (sso_service) - S5886: fix return type lies with Optional (gateway_service, services_auth) - S5890: add missing Optional for None defaults (cedar/opa schemas) - S3923: remove dead if/else branches (cache, translate) - S3457: fix logger format string (validate_signature) - S1871: merge duplicate branches (toon, schemas) - S1854: remove dead stores for current_admin_origin (sso_service) - S5247: remove 16 autoescape-false blocks, replace 2 |safe patterns with CSS word-spacing (admin.html, 11 partial templates) Closes #2981 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: update API token last_used and log usage stats Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * test fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * flake8 fix Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * additional changes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * remove duplicate last_used update in token usage logging Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * test fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * test fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * additional test fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * added exception handling Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * fix coverage Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * fix: optimize token tracking with rate-limiting and raw ASGI middleware Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * test fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * test fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * additional test fixes Signed-off-by: Shoumi <shoumimukherjee@gmail.com> * fix: address review findings in token usage tracking - Store JTI in request.state.jti for standard API tokens in _set_auth_method_from_payload, ensuring middleware can access it without re-decoding the JWT on cached/batched auth paths - Remove redundant db.commit() in TokenUsageMiddleware since log_token_usage() already commits the transaction - Add bounded eviction (max 1024 entries) to in-memory rate-limit cache in _update_api_token_last_used_sync to prevent unbounded growth - Add test coverage for cache eviction and JTI propagation - Fix formatting (missing space, unused import, black/isort) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: patch hidden sections env leakage in admin UI tests Two admin tests failed because .env sets MCPGATEWAY_UI_HIDE_SECTIONS which leaks into test execution, causing section-dependent code paths (resource loading, gRPC services) to be skipped. Patch settings to ensure no sections are hidden during these tests. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: resolve DI, retry storm, race, and opaque token tracking issues - Fix get_current_user request param type (Optional[object] → Request) so FastAPI auto-injects it, enabling last_used and usage logging - Add 30s backoff in _get_sync_redis_client after connection failures to prevent retry storms when Redis is down - Move in-memory cache from hasattr-based function attributes to module-level globals, eliminating initialization race condition - Store user_email in request state for DB-fallback opaque tokens so usage middleware can log them without JWT decode Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Shoumi <shoumimukherjee@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…authorization code (#2899) * fix: enhance OAuth callback handling for error responses and missing authorization code Signed-off-by: Ankit Gajera <ankit.gajera@merckgroup.com> * fix: sanitize root path in OAuth callback responses and remove unnecessary resource checks Signed-off-by: Ankit Gajera <ankit.gajera@merckgroup.com> * feat(tests): add Microsoft Entra ID E2E integration tests Add comprehensive end-to-end tests for Microsoft Entra ID SSO integration that validate group-based admin role assignment against real Azure infrastructure. Tests included (15 total): - Admin role assignment based on Entra ID group membership - Non-admin user handling - Dynamic role promotion on re-login - Case-insensitive group ID matching - Real ROPC token acquisition and validation - Multiple admin groups configuration - Admin role retention behavior (by design - see #2331) - Token validation (expired, invalid audience/issuer) - Sync disabled behavior The tests are fully self-contained: they create test users and groups in Azure AD before tests and clean them up afterward. Also includes comprehensive documentation in docs/docs/testing/entra-id-e2e.md covering setup, configuration, and test scenarios. Relates to: #2331 Signed-off-by: Jonathan Springer <jps@s390x.com> * Linting fixes Signed-off-by: Jonathan Springer <jps@s390x.com> * fix(tests): prevent Entra ID e2e test from poisoning RBAC decorators under xdist Module-level noop patching of RBAC decorators in test_entra_id_integration.py polluted sys.modules when sharing xdist workers with unit tests, causing 45 failures (missing __wrapped__ and decorators not raising HTTPException). Replace module-level noop patching with a fixture-scoped PermissionService mock and add missing default_admin_role to all mock_settings blocks to prevent MagicMock objects from reaching SQL queries. Signed-off-by: Jon Spriggs <github@sprig.gs> Signed-off-by: Jonathan Springer <jps@s390x.com> * fix: harden Entra v2 host matching and escape gateway_id in OAuth callback JS Use explicit frozenset of known Entra hosts (global + sovereign clouds) instead of startswith to prevent lookalike domain matching. Escape gateway_id in the JS fetch URL. Add sovereign cloud and negative tests. Signed-off-by: Jonathan Springer <jps@s390x.com> --------- Signed-off-by: Ankit Gajera <ankit.gajera@merckgroup.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: Jon Spriggs <github@sprig.gs> Co-authored-by: Jonathan Springer <jps@s390x.com>
…gger session (#3001) - Add db.rollback() to exception handlers in gateway_service.py to prevent orphaned gateway records when registration fails after flush - Isolate structured logger DB persistence using fresh_db_session() to avoid accidentally committing caller's dirty session state - Propagate flush exceptions in before_commit_handler instead of swallowing them so commit failures trigger proper rollback - Remove db parameter from structured_logger.log() signature since the logger now manages its own independent session - Update tests to verify new session isolation and exception propagation Closes #2987 Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
… appropriate list tools response (#2974) * maintain server_id context with session affinity Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com> * test cases for new improvements Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com> * linting Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com> * fix: address review findings for stateful session context propagation - Remove debug log statement that leaked auth headers and user context at INFO level (security: token/PII exposure in production logs) - Apply _get_request_context_or_default() to call_tool, read_resource, and list_resource_templates for consistent context recovery in stateful sessions (previously only list_* handlers were updated) - Fix server_id injection to create params dict when absent, ensuring JSON-RPC requests without params still get proper server scoping - Remove orphaned blank line in list_resources left from prior edit - Clean up test imports: remove duplicate pytest, unused asyncio/types/ transport imports, fix isort ordering - Fix test property cleanup in test_list_resources_direct_proxy_meta_ lookup_error to properly save/restore original descriptor - Run black + isort on all changed files Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add missing coverage for context propagation and affinity injection Cover all branches in _get_request_context_or_default(): - Fast path when ContextVars are already populated - Anonymous user string-to-dict conversion - URL without /servers/{id}/mcp pattern (no server_id match) - LookupError fallback (no active request context) - Generic Exception fallback with error logging - Cookie JWT token forwarding to require_auth_override Cover server_id injection edge cases: - JSON-RPC body without params key (params dict created) - URL without server pattern (no injection occurs) Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: normalize raw JWT payload in stateful session context recovery _get_request_context_or_default() was returning raw JWT payloads from require_auth_override() without normalizing to the canonical user context shape {email, teams, is_admin, is_authenticated} that MCP handlers expect. This caused handlers to look up missing keys (email, teams) from the raw JWT which uses different field names (sub, token_use, nested user.is_admin). Changes: - Add _normalize_jwt_payload() that mirrors streamable_http_auth normalization: extracts email from sub/email, resolves teams via normalize_token_teams (API) or _resolve_teams_from_db_sync (session), and detects nested user.is_admin - call_tool now extracts app_user_email from the already-recovered user_context instead of reading from stale ContextVar via get_user_email_from_context() - Update test mocks to use realistic JWT payloads instead of pre-normalized dicts - Add 6 dedicated tests for _normalize_jwt_payload covering API token, session token (admin/non-admin), nested is_admin, email fallback, and no-email edge case Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: add regression test for call_tool recovered email propagation Verify call_tool passes app_user_email from the recovered fallback context (via _get_request_context_or_default) rather than reading from the stale user_context_var ContextVar. This prevents regression of the stateful session identity propagation fix. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: add missing docstring params for _normalize_jwt_payload (flake8 DAR) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Update security pre-commit tools Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Update all dependencies Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: resolve Containerfile build failures found during full validation - csv_pandas_chat: remove COPY templates/ (directory doesn't exist) - data_analysis/mcp_eval: replace shell heredoc with printf (Docker parses heredoc content as Dockerfile instructions); remove COPY of non-existent __init__.py and config/ directory - latex_server: switch to Fedora 41 base (texlive packages not in UBI10 repos) - libreoffice_server: switch to Fedora 41 base (libreoffice not in UBI10 repos) - python_sandbox: remove coreutils from builder/runtime (conflicts with coreutils-single pre-installed in UBI images) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: make xdg-open non-fatal in Rust coverage target The coverage Makefile target calls xdg-open to display the HTML report, which fails in CI where no browser is available. Make it non-fatal so the coverage report generation still succeeds. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ing (#2958) The config validator now converts empty/whitespace-only X_FRAME_OPTIONS values to None, matching the middleware's expectation that None means "allow all embedding". Previously, an empty string fell through to the default DENY behavior in the middleware, blocking embedding unexpectedly. Closes #2492 Signed-off-by: prakhar-singh1928 <prakhar.singh1928@ibm.com> Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…der (#3026) * fix(auth): add jwks_uri column to SSOProvider and harden create_provider Add jwks_uri as a first-class column on SSOProvider for standard OIDC JWKS endpoint support. Make create_provider defensive by filtering unknown keys to prevent TypeError crashes during SSO bootstrap. Closes #3010 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(auth): include jwks_uri in bootstrap_db schema check and document setting Update _schema_looks_current() to check for sso_providers.jwks_uri, preventing unversioned databases from being stamped at head without the new column. Add SSO_GENERIC_JWKS_URI to .env.example. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * lint Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* update roadmap and changelog Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * update links in docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ion for observability service Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
f49ffcc to
df896cf
Compare
- Eliminate double plugin execution on failure by restructuring _execute_with_timeout to separate observability from plugin execution - Close orphaned spans with status="error" when plugin hooks raise - Fix docstring typo in ObservabilityMiddleware - Move ObservabilityServiceAdapter import inside conditional block - Document ContextVar bridge requirement on both definitions Signed-off-by: Jonathan Springer <jps@s390x.com>
df896cf to
19247b3
Compare
jonpspri
left a comment
There was a problem hiding this comment.
This refactor cleanly decouples the plugin framework from gateway-internal observability by introducing an ObservabilityProvider protocol and dependency-injecting it through PluginManager into PluginExecutor, replacing the previous tight coupling where _execute_with_timeout directly imported and instantiated ObservabilityService and SessionLocal. The new ObservabilityServiceAdapter encapsulates DB session lifecycle management on the gateway side, while the framework side operates purely against the protocol interface — a sound architectural boundary that makes the plugin framework independently testable and reusable. The ContextVar bridge between the middleware and framework is well-documented, the NullObservability no-op provides a clean default, and the test coverage is thorough across happy paths, error/rollback scenarios, Borg singleton late-binding, and the context variable bridge. The wiring in main.py is minimal and straightforward. LGTM.
Cover error and edge-case paths in plugin executor observability spans and get_plugin_manager initialization with plugins enabled. Signed-off-by: Jonathan Springer <jps@s390x.com>
jonpspri
left a comment
There was a problem hiding this comment.
See review comments above.
|
Thanks for the thorough fixes, @araujof. Both critical issues from my previous review are resolved:
No new issues found. The decoupling goal is cleanly achieved — the plugin framework no longer imports gateway internals. LGTM. |
|
Reopened as #3140. CI/CD will re-run on the new PR. You are still credited as the author. |
Description
Removes the plugin framework's direct dependency on
mcpgateway.db.SessionLocalandmcpgateway.services.observability_serviceby introducing a protocol-based observability abstraction with dependency injection.Closes: #2828
Problem
The plugin framework's
_execute_with_timeoutmethod importedSessionLocalandObservabilityServicedirectly from the gateway, creating a tight coupling that prevented the framework from operating standalone. This also required the plugin framework to manage database sessions for tracing — a concern that belongs to the host application.Solution
ObservabilityProviderprotocol andNullObservabilityno-op default in a newobservability.pymodulecurrent_trace_idContextVar owned by the plugin framework (replacing the one imported from the gateway's observability service)PluginExecutorandPluginManagernow accept an optionalobservabilityparameter via constructor injectionget_plugin_manager()forwards the observability provider to the managerDependency direction (before → after)
Changes
mcpgateway/plugins/framework/observability.pyObservabilityProviderprotocol,NullObservabilitydefault,current_trace_idContextVarmcpgateway/plugins/framework/manager.pyPluginExecutorandPluginManageracceptobservabilityparam;_execute_with_timeoutuses injected provider instead of direct gateway importsmcpgateway/plugins/framework/__init__.pyget_plugin_manager()acceptsobservabilityparam;ObservabilityProvideradded to__all__tests/unit/.../test_observability.pytests/unit/.../test_manager_coverage.pyTestExecuteWithTimeoutto test against injected provider instead of mockingmcpgateway.db/mcpgateway.servicesAdditional Changes
A few tests in
tests/unit/../test_http_auth_integration.pywere failing on dev environment when running the coverage target. Solution: Reset Starlette's cached middleware_stack after injecting the mock plugin manager so the middleware chain is rebuilt with the mock, and restore it afterward to avoid side effects on other tests.Test plan