Skip to content

feat(security): implement policy engine phase 1#2682

Closed
yiannis2804 wants to merge 2237 commits intoIBM:mainfrom
yiannis2804:feature/issue-2019-policy-engine-phase1
Closed

feat(security): implement policy engine phase 1#2682
yiannis2804 wants to merge 2237 commits intoIBM:mainfrom
yiannis2804:feature/issue-2019-policy-engine-phase1

Conversation

@yiannis2804
Copy link
Copy Markdown
Contributor

✨ Feature / Enhancement PR

🔗 Epic / Issue

Link to the epic or parent issue:
Closes #2019 (Phase 1)

Submitted by: sweng-group-5
Note: Phases 2, 3, and 4 will be completed by other team members in subsequent PRs.


🚀 Summary (1-2 sentences)

Implements Phase 1 of the centralized RBAC/ABAC Policy Engine: creates PolicyEngine service with unified authorization, adds 4 database tables for policy management, and migrates all 250 @require_permission decorators to use the new system.


🧪 Checks

  • make lint passes
  • make test passes
  • make verify passes (10/10 Mascarpone)
  • 16 unit tests with 89% coverage on policy_engine.py
  • CHANGELOG updated (if user-facing)

📓 Notes

Team Context

This PR is Phase 1 of 4 for issue #2019, completed by sweng-group-5. The remaining phases are assigned to other team members:

  • Phase 1 (this PR): Foundation - PolicyEngine service, database models, decorator migration ✅
  • Phase 2: ABAC support with policy conditions 🔜
  • Phase 3: Admin UI for policy management 🔜
  • Phase 4: Advanced features (versioning, external engines) 🔜

What Was Accomplished

PolicyEngine Service (400+ lines)

  • Unified check_access() method for all authorization
  • Implements: admin bypass, permission checks, owner/team/visibility logic
  • Comprehensive audit logging for all decisions

Database Schema (4 new tables)

  • access_permissions - Configurable permissions
  • access_policies - Policy rules with conditions (Phase 2 ready)
  • access_decisions - Complete audit trail
  • resource_access_rules - Fine-grained permissions (Phase 4 ready)

Complete Migration (250 endpoints)

  • All @require_permission decorators → @require_permission_v2
  • Files: main.py (66), admin.py (95), 11 routers (89)
  • Zero breaking changes - backward compatibility flag included

Architecture

flowchart TD
    A[HTTP Request] -->|Decorator| B[require_permission_v2]
    B -->|Extract user/db| C[PolicyEngine.check_access]
    C -->|Build| D[Subject<br/>email, roles, teams, permissions]
    C -->|Build| E[Resource<br/>type, id, owner, visibility]
    C -->|Build| F[Context<br/>ip, user-agent, timestamp]
    D --> G{Authorization Logic}
    E --> G
    F --> G
    G -->|1. Admin?| H[Allow]
    G -->|2. Has permission?| H
    G -->|3. Is owner?| H
    G -->|4. Team member?| H
    G -->|5. Public resource?| H
    G -->|Default| I[Deny]
    H --> J[Log Decision]
    I --> J
    J --> K[AccessDecisionLog Table]
    J --> L[Return AccessDecision]
    L -->|allowed=true| M[Execute Endpoint]
    L -->|allowed=false| N[HTTP 403]
Loading

Impact & Next Steps

Immediate:

  • Centralized authorization - single point of control
  • Full audit trail of all access decisions
  • Foundation for Phases 2-4

Unblocks Team Members:

  • Phase 2: ABAC support with policy conditions (uses AccessPolicy table)
  • Phase 3: Admin UI for policy management (uses all 4 tables)
  • Phase 4: Advanced features (builds on PolicyEngine interface)

Testing

  • 16 unit tests covering all PolicyEngine paths
  • 89% code coverage on policy_engine.py
  • Backward compatibility via SKIP_POLICY_ENGINE env var
  • All existing tests pass

madhav165 and others added 30 commits January 25, 2026 00:39
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>
* fix: FastMCP compatibility

* fix: normalize issuer URL for metadata validation and caching

The original trailing slash fix introduced a bug where the issuer
validation would fail when the server returned an issuer without
trailing slash but the client passed one (or vice versa).

Changes:
- Normalize both the input issuer and metadata issuer for comparison
- Use normalized issuer as cache key for consistent cache lookup
- Add tests for trailing slash normalization scenarios
- Update test to expect refresh_token in grant_types

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: complete issuer normalization and conditional refresh_token

Address review feedback:

1. Normalize issuer consistently across the entire DCR flow:
   - Allowlist validation uses normalized comparison
   - Storage uses normalized issuer
   - Lookup uses normalized issuer

2. Make refresh_token conditional on AS support:
   - Check grant_types_supported in AS metadata
   - Only request refresh_token if AS advertises support

3. Fix grant_types fallback:
   - Use requested grant_types as fallback when AS response omits them
   - Previously hardcoded to ["authorization_code"] which dropped refresh_token

4. Add comprehensive tests:
   - Test refresh_token inclusion when AS supports it
   - Test grant_types fallback behavior
   - Test allowlist trailing slash normalization

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: handle null grant_types_supported and add issuer normalization migration

Address additional review findings:

1. Fix TypeError when grant_types_supported is explicit null:
   - Use `metadata.get("grant_types_supported") or []` instead of
     `metadata.get("grant_types_supported", [])`
   - The latter returns None when key exists with null value

2. Add configurable permissive refresh_token mode:
   - New setting: dcr_request_refresh_token_when_unsupported
   - Default: False (strict mode - only request if AS advertises support)
   - When True: request refresh_token if AS omits grant_types_supported

3. Add Alembic migration to normalize legacy issuer values:
   - Strips trailing slashes from registered_oauth_clients.issuer
   - Idempotent and works with SQLite and PostgreSQL
   - Prevents duplicate registrations from legacy rows

4. Add comprehensive tests:
   - Test explicit null grant_types_supported handling
   - Test permissive refresh_token mode

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add DCR_REQUEST_REFRESH_TOKEN_WHEN_UNSUPPORTED to documentation

Update documentation for new DCR refresh token configuration option:

- README.md: Add to DCR settings table
- charts/mcp-stack/values.yaml: Add with comment
- charts/mcp-stack/README.md: Regenerated via helm-docs
- docs/docs/manage/dcr.md: Add env var and behavior note
- docs/docs/config.schema.json: Add schema definition

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: add ARM64 load testing support

- Add build section for fast-time-server to support ARM64 architecture
- Use pre-built ghcr.io image by default for x86_64 performance
- ARM64 users can build locally via environment variable override
- Fix Dockerfile to use TARGETARCH for proper cross-compilation

Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Lint

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…ched Templates (IBM#2333)

* Optimize SQLite JSON tag filtering with deterministic binds and cached templates

Signed-off-by: Satya <tsp.0713@gmail.com>

* feat: add tag filtering support to list resources template in main apis(non-template)

Signed-off-by: Satya <tsp.0713@gmail.com>

* removed unused fields - page, limit from list resource template from resource services

Signed-off-by: Satya <tsp.0713@gmail.com>

* fix: remove debug print statements from tool_service.py

Remove debugging print statements that were accidentally left in the
tag filtering code path. These were outputting query details to stdout
which is not appropriate for production code.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Lint

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: use column-specific bind prefixes to prevent parameter collision

When multiple json_contains_tag_expr calls are combined in the same
query (e.g., filtering on tags from different columns), the fixed
bind names (:p0, :p1) would collide and overwrite parameters.

This fix adds column-specific prefixes to bind parameter names
(e.g., :tools_tags_p0, :resources_tags_p0) to ensure uniqueness
when composing multiple tag filter predicates.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: add coverage for json_contains_tag_expr and resource template filters

Add comprehensive tests for:
- _sanitize_col_prefix helper function
- json_contains_tag_expr for SQLite with match_any and match_all
- Bind parameter collision prevention when combining multiple tag filters
- LRU caching of SQL templates
- New list_resource_templates filtering parameters (tags, visibility,
  include_inactive)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: use thread-safe counter for fully unique bind prefixes

Address edge cases where bind parameters could still collide:
1. Same column filtered multiple times in one query
2. Different column refs that sanitize to identical strings
   (e.g., "a_b.c" and "a.b_c" both become "a_b_c")

Replace static column-based prefix with a thread-safe counter that
generates truly unique prefixes per call (e.g., "tools_tags_42_p0").

This removes the LRU caching of templates since each call now has
a unique prefix, but ensures correctness in all edge cases.

Add test for same-column collision scenario.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Satya <tsp.0713@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* optimize response_cache_by_prompt lookup with inverted index

Signed-off-by: Shoumi <shoumimukherjee@gmail.com>

* fix type hint

Signed-off-by: Shoumi <shoumimukherjee@gmail.com>

* flake8 fixes

Signed-off-by: Shoumi <shoumimukherjee@gmail.com>

* test: add unit tests for response_cache_by_prompt inverted index

Add comprehensive test coverage for the inverted index optimization:
- Tokenization and vectorization functions
- Basic cache store and hit functionality
- Inverted index population and candidate filtering
- Eviction and index rebuild scenarios
- Max entries cap with index consistency

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Shoumi <shoumimukherjee@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: Add Gateway permission constants

Add GATEWAYS_CREATE, GATEWAYS_READ, GATEWAYS_UPDATE, and GATEWAYS_DELETE
permission constants to the Permissions class for consistency with other
resource types (tools, resources, prompts, servers).

Note: The original PR IBM#2186 attempted to fix issue IBM#2185 by modifying
the visibility query logic, but that change was incorrect. The team
filter should only show resources BELONGING to the filtered team,
not all public resources globally. See todo/rbac.md for documentation.

Issue IBM#2185 needs further investigation - the reported bug may have
a different root cause.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat: Add gateway permission patterns to token scoping middleware

Add gateway routes to token scoping middleware for consistent permission
enforcement:
- Add gateway pattern to _RESOURCE_PATTERNS for ID extraction
- Add gateway CRUD patterns to _PERMISSION_PATTERNS:
  - POST /gateways (exact) -> gateways.create
  - POST /gateways/{id}/... (sub-resources) -> gateways.update
  - PUT/DELETE -> gateways.update/delete
- Add gateway handling in _check_resource_team_ownership:
  - Public: accessible by all
  - Team: accessible by team members
  - Private: owner-only access (per RBAC doc)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Enforce owner-only access for private visibility across all resources

Per RBAC doc, private visibility means "owner only" - not "team members".
Fixed private visibility checks for all resource types to validate
owner_email == requester instead of team membership:
- Servers
- Tools
- Resources
- Prompts
- Gateways (already correct from previous commit)

This aligns token scoping middleware with the documented RBAC model.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: Add tests for gateway permissions and visibility RBAC

Add unit tests covering:
- Gateway permission patterns (POST create vs POST update sub-resources)
- Private visibility enforces owner-only access
- Team visibility allows team members only
- Public visibility allows all authenticated users

These tests validate the RBAC fixes in token scoping middleware.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat-2187: add additional default roles while bootstrap

Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>

* feat-2187: fix lint issues

Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>

* feat-2187: fixing review comments

Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>

* feat-2187: fixing review comments

Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>

* feat-2187: test fix

Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>

* fix: Improve bootstrap roles validation and documentation

Fixes identified by code review:
1. Path resolution: Fixed parent.parent.parent -> parent.parent to correctly
   resolve project root from mcpgateway/bootstrap_db.py
2. JSON validation: Added validation that loaded JSON is a list of dicts with
   required keys (name, scope, permissions). Invalid entries are skipped with
   warnings instead of crashing bootstrap.
3. Improved logging: Log all attempted paths when file not found

Added tests:
- test_bootstrap_roles_with_dict_instead_of_list: Validates error when JSON is
  a dict instead of array
- test_bootstrap_roles_with_missing_required_keys: Validates warning when roles
  are missing required fields

Added documentation:
- docs/docs/manage/rbac.md: New "Bootstrap Custom Roles" section with
  configuration examples for Docker Compose and Kubernetes
- docs/docs/architecture/adr/036-bootstrap-custom-roles.md: ADR documenting
  the feature design, error handling, and security considerations

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Make description and is_system_role optional for bootstrap roles

ChatGPT review identified that description and is_system_role were accessed
unconditionally via role_def["key"], causing KeyError for minimal roles.

Fix:
- Use role_def.get("description", "") with empty string default
- Use role_def.get("is_system_role", False) with False default

Added test:
- test_bootstrap_roles_with_minimal_valid_role: Verifies a role with only
  required fields (name, scope, permissions) is created successfully with
  correct defaults for optional fields

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Nithin Katta <Nithin.Katta@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Nithin Katta <Nithin.Katta@ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…y blockers (IBM#2394)

* Remove last 2 security issues from Sonarqube

Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

* Remove 5 of 8 blocker maintainability issues

Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

* Correct linting errors

Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>

---------

Signed-off-by: Brian Hussey <brian.hussey@ie.ibm.com>
…ad (IBM#2157)

* perf(crypto): offload Argon2/Fernet to threadpool via asyncio.to_thread

Add async wrappers (hash_password_async, verify_password_async,
encrypt_secret_async, decrypt_secret_async) and update all call sites
to use them, preventing event loop blocking during CPU-intensive
crypto operations.

Closes IBM#1836

Signed-off-by: ESnark <31977180+ESnark@users.noreply.github.com>

* fix(tests): update tests for async crypto operations

Update test mocks to use async versions of password service and
encryption service methods (hash_password_async, verify_password_async,
encrypt_secret_async) following the changes in the crypto offload PR.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sso): add missing await for async create/update provider methods

The crypto offload PR made SSOService.create_provider() and
update_provider() async, but forgot to update call sites:

- mcpgateway/routers/sso.py: add await in admin endpoints
- mcpgateway/utils/sso_bootstrap.py: convert to async, add awaits
- mcpgateway/main.py: make attempt_to_bootstrap_sso_providers async

Without this fix, the router endpoints would return coroutine objects
instead of provider objects, causing runtime errors (500) when
accessing provider.id. The bootstrap would silently skip provider
creation with "coroutine was never awaited" warnings.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test(crypto): add tests for async crypto wrappers and SSO bootstrap

Add test coverage for the async crypto operations introduced by the
crypto offload PR:

- test_async_crypto_wrappers.py: Tests for hash_password_async,
  verify_password_async, encrypt_secret_async, decrypt_secret_async
  including roundtrip verification and sync/async compatibility

- test_sso_bootstrap.py: Tests for async SSO bootstrap ensuring
  create_provider and update_provider are properly awaited

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: ESnark <31977180+ESnark@users.noreply.github.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* chore-2193: add Rocky Linux setup script

Add setup script for Rocky Linux and RHEL-compatible distributions.
Adapts the Ubuntu setup script with the following changes:

- Use dnf package manager instead of apt
- Docker CE installation via RHEL repository
- OS detection for Rocky, RHEL, CentOS, and AlmaLinux
- Support for x86_64 and aarch64 architectures

Closes IBM#2193

Signed-off-by: Jonathan Springer <jps@s390x.com>

* chore-2193: add Docker login check before compose-up

Check if Docker is logged in before running docker-compose to avoid
image pull failures. If not logged in, prompt user with options:
- Interactive login (username/password prompts)
- Username with password from stdin (for automation)
- Skip login (continue without authentication)

Supports custom registry URLs for non-Docker Hub registries.

Signed-off-by: Jonathan Springer <jps@s390x.com>

* fix: add non-interactive mode and git repo check to setup scripts

Apply to both Rocky and Ubuntu setup scripts:
- Add -y/--yes flag for fully non-interactive operation
- Check for .git directory before running git pull
- Fail fast with clear error if directory exists but isn't a git repo
- Auto-confirm prompts in non-interactive mode
- Exit with error on unsupported OS in non-interactive mode

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Linting

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix-2360: prevent asyncio CPU spin loop after SSE client disconnect

Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned
tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker.

Changes:
- Add _respond_tasks dict to track respond tasks by session_id
- Cancel respond tasks explicitly before session cleanup in remove_session()
- Cancel all respond tasks during shutdown()
- Pass disconnect callback to SSE transport for defensive cleanup
- Convert database backend from fire-and-forget to structured concurrency

The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect,
and awaited to completion, preventing orphaned tasks from spinning the event loop.

Closes IBM#2360

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: additional fixes for CPU spin loop after SSE disconnect

Follow-up fixes based on testing and review:

1. Cancellation timeout escalation (Finding 1):
   - _cancel_respond_task() now escalates on timeout by calling transport.disconnect()
   - Retries cancellation after escalation
   - Always removes task from tracking to prevent buildup

2. Redis respond loop exit path (Finding 2):
   - Changed from infinite pubsub.listen() to timeout-based get_message() polling
   - Added session existence check - loop exits if session removed
   - Allows loop to exit even without cancellation

3. Generator finally block cleanup (Finding 3):
   - Added on_disconnect_callback() in event_generator() finally block
   - Covers: CancelledError, GeneratorExit, exceptions, and normal completion
   - Idempotent - safe if callback already ran from on_client_close

4. Added load-test-spin-detector make target:
   - Spike/drop pattern to stress test session cleanup
   - Docker stats monitoring at each phase
   - Color-coded output with pass/fail indicators
   - Log file output to /tmp

Closes IBM#2360

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: fix race condition in sse_endpoint and add stuck task tracking

Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task
was created AFTER create_sse_response(). If client disconnected during
response setup, the disconnect callback ran before the task existed,
leaving it orphaned. Now matches utility_sse_endpoint ordering:
1. Compute user_with_token
2. Create and register respond task
3. Call create_sse_response()

Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't
be cancelled after escalation. Previously these were dropped from tracking
entirely, losing visibility. Now they're moved to _stuck_tasks for
monitoring and final cleanup during shutdown().

Updated tests to verify escalation behavior.

Closes IBM#2360

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: add SSE failure cleanup, stuck task reaper, and full load test

Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response()
fails. Added try/except around create_sse_response() in both sse_endpoint
and utility_sse_endpoint - on failure, calls remove_session() to clean up
the task and session before re-raising.

Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to:
- Remove completed tasks from _stuck_tasks
- Retry cancellation for still-stuck tasks
- Prevent memory leaks from tasks that eventually complete

Finding 3 (LOW): Added test for escalation path with fake transport to
verify transport.disconnect() is called during escalation. Also added
tests for the stuck task reaper lifecycle.

Also updated load-test-spin-detector to be a full-featured test matching
load-test-ui with JWT auth, all user classes, entity ID fetching, and
the same 4000-user baseline.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: improve load-test-spin-detector output and reduce cycle sizes

- Reduce logging level to WARNING to suppress noisy worker messages
- Only run entity fetching and cleanup on master/standalone nodes
- Reduce cycle sizes from 4000 to 1000 peak users for faster iteration
- Update banner to reflect new cycle pattern (500 -> 750 -> 1000)
- Remove verbose JWT token generation log

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: address remaining CPU spin loop findings

Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE
endpoints. In Python 3.8+, CancelledError inherits from BaseException,
not Exception, so the previous except block wouldn't catch it. Now
cleanup runs even when requests are cancelled during SSE handshake.

Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None
to prevent tight loop. The loop now has guaranteed minimum sleep even
when Redis returns immediately in certain states.

Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops
to exit early. remove_session() now marks the session as closing BEFORE
attempting task cancellation, so the respond loop (Redis and DB backends)
can exit immediately without waiting for the full cancellation timeout.

Finding 4 (LOW): Already addressed in previous commit with test
test_cancel_respond_task_escalation_calls_transport_disconnect.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: make load-test-spin-detector run unlimited cycles

- Cycles now repeat indefinitely instead of stopping after 5
- Fixed log file path to /tmp/spin_detector.log for easy monitoring
- Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts
- Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)")
- Banner shows monitoring command: tail -f /tmp/spin_detector.log

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: add asyncio.CancelledError to SSE endpoint Raises docs

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Linting

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: remove redundant asyncio.CancelledError handlers

CancelledError inherits from BaseException in Python 3.8+, so it won't
be caught by 'except Exception' handlers. The explicit handlers were
unnecessary and triggered pylint W0706 (try-except-raise).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: restore asyncio.CancelledError in Raises docs for inner handlers

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix-2360: add sleep on non-message Redis pubsub types to prevent spin

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(pubsub): replace blocking listen() with timeout-based get_message()

The blocking `async for message in pubsub.listen()` pattern doesn't
respond to asyncio cancellation properly. When anyio's cancel scope
tries to cancel tasks using this pattern, the tasks don't respond
because the async iterator is blocked waiting for Redis messages.

This causes anyio's `_deliver_cancellation` to continuously reschedule
itself with `call_soon()`, creating a CPU spin loop that consumes
100% CPU per affected worker.

Changed to timeout-based polling pattern:
- Use `get_message(timeout=1.0)` with `asyncio.wait_for()`
- Loop allows cancellation check every ~1 second
- Added sleep on None/non-message responses to prevent edge case spins

Files fixed:
- mcpgateway/services/cancellation_service.py
- mcpgateway/services/event_service.py

Closes IBM#2360 (partial - additional spin sources may exist)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(cleanup): add timeouts to __aexit__ calls to prevent CPU spin loops

The MCP session/transport __aexit__ methods can block indefinitely when
internal tasks don't respond to cancellation. This causes anyio's
_deliver_cancellation to spin in a tight loop, consuming ~800% CPU.

Root cause: When calling session.__aexit__() or transport.__aexit__(),
they attempt to cancel internal tasks (like post_writer waiting on
memory streams). If these tasks don't respond to CancelledError, anyio's
cancel scope keeps calling call_soon() to reschedule _deliver_cancellation,
creating a CPU spin loop.

Changes:
- Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py
- Wrap all __aexit__ calls in asyncio.wait_for() with timeout
- Add timeout to pubsub cleanup in session_registry.py and registry_cache.py
- Add timeout to streamable HTTP context cleanup in translate.py

This is a continuation of the fix for issue IBM#2360.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat(config): make session cleanup timeout configurable

Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to
control how long cleanup operations wait for session/transport __aexit__
calls to complete.

Clarification: This timeout does NOT affect tool execution time (which
uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions
to prevent CPU spin loops when internal tasks don't respond to cancel.

Changes:
- Add mcp_session_pool_cleanup_timeout to config.py
- Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs
- Add to charts/mcp-stack/values.yaml
- Update mcp_session_pool.py to use _get_cleanup_timeout() helper
- Update session_registry.py and registry_cache.py to use config
- Update translate.py to use config with fallback

When to adjust:
- Increase if you see frequent "cleanup timed out" warnings in logs
- Decrease for faster shutdown (at risk of resource leaks)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add deadline to cancel scope to prevent CPU spin loop

Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at
100% CPU when SSE task group tasks don't respond to cancellation.

Root cause: When an SSE connection ends, sse_starlette's task group
tries to cancel all tasks. If a task (like _listen_for_disconnect
waiting on receive()) doesn't respond to cancellation, anyio's
_deliver_cancellation keeps rescheduling itself in a tight loop.

Fix: Override EventSourceResponse.__call__ to set a deadline on the
cancel scope when cancellation starts. This ensures that if tasks
don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds),
the scope times out instead of spinning indefinitely.

References:
- agronholm/anyio#695
- anthropics/claude-agent-sdk-python#378

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(translate): use patched EventSourceResponse to prevent CPU spin

translate.py was importing EventSourceResponse directly from sse_starlette,
bypassing the patched version in sse_transport.py that prevents the anyio
_deliver_cancellation CPU spin loop (anyio#695).

This change ensures all SSE connections in the translate module (stdio-to-SSE
bridge) also benefit from the cancel scope deadline fix.

Relates to: IBM#2360

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(cleanup): reduce cleanup timeouts from 5s to 0.5s

With many concurrent connections (691 TCP sockets observed), each cancelled
SSE task group spinning for up to 5 seconds caused sustained high CPU usage.
Reducing the timeout to 0.5s minimizes CPU waste during spin loops while
still allowing normal cleanup to complete.

The cleanup timeout only affects cleanup of cancelled/released connections,
not normal operation or tool execution time.

Changes:
- SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds
- mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds
- Updated .env.example and charts/mcp-stack/values.yaml

Relates to: IBM#2360

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* refactor(cleanup): make SSE cleanup timeout configurable with safe defaults

- Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s)
- Make sse_transport.py read timeout from config via lazy loader
- Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default
- Override both to 0.5s in docker-compose.yml for testing

The 5.0s default is safe for production. The 0.5s override in
docker-compose.yml allows testing aggressive cleanup to verify
it doesn't affect normal operation.

Relates to: IBM#2360

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(gunicorn): reduce max_requests to recycle stuck workers

The MCP SDK's internal anyio task groups don't respond to cancellation
properly, causing CPU spin loops in _deliver_cancellation. This spin
happens inside the MCP SDK (streamablehttp_client, sse_client) which
we cannot patch.

Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are
recycled frequently, cleaning up any accumulated stuck task groups.

Root cause chain observed:
1. PostgreSQL idle transaction timeout
2. Gateway state change failures
3. SSE connections terminated
4. MCP SDK task groups spin (anyio#695)

This is a workaround until the MCP SDK properly handles cancellation.

Relates to: IBM#2360

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Linting

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(anyio): monkey-patch _deliver_cancellation to prevent CPU spin

Root cause: anyio's _deliver_cancellation has no iteration limit.
When tasks don't respond to CancelledError, it schedules call_soon()
callbacks indefinitely, causing 100% CPU spin (anyio#695).

Solution:
- Monkey-patch CancelScope._deliver_cancellation to track iterations
- Give up after 100 iterations and log warning
- Clear _cancel_handle to stop further call_soon() callbacks

Also switched from asyncio.wait_for() to anyio.move_on_after() for
MCP session cleanup, which better propagates cancellation through
anyio's cancel scope system.

Trade-off: If cancellation gives up after 100 iterations, some tasks
may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000
worker recycling will eventually clean up orphaned tasks.

Closes IBM#2360

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* refactor(anyio): make _deliver_cancellation patch optional and disabled by default

The anyio monkey-patch is now feature-flagged and disabled by default:
- ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default)
- ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100

This allows testing performance with and without the patch, and easy
rollback if upstream anyio/MCP SDK fixes the issue.

Added:
- Config settings for enabling/disabling the patch
- apply_anyio_cancel_delivery_patch() function for explicit control
- remove_anyio_cancel_delivery_patch() to restore original behavior
- Documentation in .env.example and docker-compose.yml

To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add comprehensive CPU spin loop mitigation documentation (IBM#2360)

Add multi-layered documentation for CPU spin loop mitigation settings
across all configuration files. This ensures operators understand and
can tune the workarounds for anyio#695.

Changes:
- .env.example: Add Layer 1/2/3 headers with cross-references to docs
  and issue IBM#2360, document all 6 mitigation variables
- README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers,
  configuration tables, and tuning tips
- docker-compose.yml: Consolidate all mitigation variables into one
  section with SSE protection (Layer 1), cleanup timeouts (Layer 2),
  and experimental anyio patch (Layer 3)
- charts/mcp-stack/values.yaml: Add comprehensive mitigation section
  with layer documentation and cross-references
- docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide
  with root cause analysis, 4-layer defense diagram, configuration
  tables, diagnostic commands, and tuning recommendations
- docs/docs/.pages: Add Operations section to navigation
- docs/docs/operations/.pages: Add nav for operations docs

Mitigation variables documented:
- Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX
- Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT
- Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS

Related: IBM#2360, anyio#695, claude-agent-sdk#378
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat(loadtest): aggressive spin detector with configurable timings

Update spin detector load test for faster issue reproduction:
- Increase user counts: 4000 → 4000 → 10000 pattern
- Fast spawn rate: 1000 users/s
- Shorter wait times: 0.01-0.1s between requests
- Reduced connection timeouts: 5s (fail fast)

Related: IBM#2360
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* compose mitigation

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* load test

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Defaults

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Defaults

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add docstring to cancel_on_finish for interrogate coverage

Add docstring to nested cancel_on_finish function in
EventSourceResponse.__call__ to achieve 100% interrogate coverage.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
IBM#2507)

Updates unique constraints for Resources and Prompts tables to support
Gateway-level namespacing. Previously, these entities enforced uniqueness
globally per Team/Owner (team_id, owner_email, uri/name). This prevented
users from registering the same Gateway multiple times with different names.

Changes:
- Add gateway_id to unique constraints for resources and prompts
- Add partial unique indexes for local items (where gateway_id IS NULL)
- Make migration idempotent with proper existence checks

Closes IBM#2352

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…BM#2517)

* fix(transport): support mixed content types from MCP server tool call response

Closes IBM#2512

This fix addresses tool invocation failures for tools that return complex
content types (like ResourceLink, ImageContent, AudioContent) or contain
Pydantic-specific types like AnyUrl.

Root causes fixed:
1. tool_service.py: Usage of model_dump() without mode='json' preserved
   pydantic.AnyUrl objects, violating internal model's str type constraints.
2. streamablehttp_transport.py: Code blindly assumed types.TextContent,
   accessing .text on every item, which crashed for ResourceLink or ImageContent.

Changes:
- Updated tool_service.py to use model_dump(by_alias=True, mode='json'),
  forcing conversion of AnyUrl to JSON-compatible strings.
- Refactored streamablehttp_transport.py to inspect content.type and correctly
  map to proper MCP SDK types (TextContent, ImageContent, AudioContent,
  ResourceLink, EmbeddedResource) ensuring full protocol compatibility.
- Updated return type annotation to include all MCP content types.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(transport): preserve metadata in mixed content type conversion

Addresses dropped metadata fields identified in PR IBM#2517 review:
- Preserve annotations and _meta for TextContent, ImageContent, AudioContent
- Preserve size and _meta for ResourceLink (critical for file metadata)
- Handle EmbeddedResource via model_validate

Add comprehensive regression tests for:
- Mixed content types (text, image, audio, resource_link, embedded)
- Metadata preservation (annotations, _meta, size)
- Unknown content type fallback
- Missing optional metadata handling

Closes IBM#2512

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(transport): convert gateway Annotations to dict for MCP SDK compatibility

mcpgateway.common.models.Annotations is a different Pydantic class from
mcp.types.Annotations. Passing gateway Annotations directly to MCP SDK
types causes ValidationError at runtime when real MCP responses include
annotations.

Fix:
- Add _convert_annotations() helper to convert gateway Annotations to dict
- Add _convert_meta() helper for consistent meta handling
- Apply conversion to all content types (text, image, audio, resource_link)

Add regression tests using actual gateway model types:
- test_call_tool_with_gateway_model_annotations
- test_call_tool_with_gateway_model_image_annotations

These tests use mcpgateway.common.models.TextContent/ImageContent with
mcpgateway.common.models.Annotations to verify the conversion works.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test(tool_service): add AnyUrl serialization tests for mode='json' fix

Add explicit tests for the AnyUrl serialization fix (Issue IBM#2512 root cause):
- test_anyurl_serialization_without_mode_json - demonstrates the problem
- test_anyurl_serialization_with_mode_json - verifies the fix
- test_resource_link_anyurl_serialization - ResourceLink uri field
- test_tool_result_with_resource_link_serialization - ToolResult with ResourceLink
- test_mixed_content_with_anyurl_serialization - mixed content types

These tests verify that mode='json' in model_dump() correctly serializes
AnyUrl objects to strings, preventing validation errors when content is
passed to MCP SDK types.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs(transport): add docstrings to _convert_annotations and _convert_meta

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs(transport): add Args/Returns to helper function docstrings

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Add user information (email, full_name, is_admin) to the plugin global
context, enabling plugins like Cedar RBAC to make access control decisions
based on user attributes beyond just email.

Changes:
- Add _inject_userinfo_instate() function to auth.py that populates
  global_context.user as a dictionary when include_user_info is enabled
- Update GlobalContext.user type to Union[str, dict] for backward compat
- Add include_user_info config option to plugin_settings (default: false)
- Prevent tool_service from overwriting user dict with string email

The feature is disabled by default to maintain backward compatibility
with existing plugins that expect global_context.user to be a string.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Shoumi <shoumimukherjee@gmail.com>
…BM#2529)

* Add profling tools, memray

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Add profling tools, memray

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(db): release DB sessions before external HTTP calls to prevent pool exhaustion

This commit addresses issue IBM#2518 where DB connection pool exhaustion occurred
during A2A and RPC tool calls due to sessions being held during slow upstream
HTTP requests.

Changes:
- tool_service.py: Extract A2A agent data to local variables before calling
  db.commit(), allowing HTTP calls to proceed without holding the DB session.
  The A2A tool invocation logic now uses pre-extracted data instead of querying
  during the HTTP call phase.

- rbac.py: Add db.commit() and db.close() calls before returning user context
  in all authentication paths (proxy, anonymous, disabled auth). This ensures
  DB sessions are released early and not held during subsequent request processing.

- test_rbac.py: Update test to provide mock db parameter and verify that
  db.commit() and db.close() are called for proper session cleanup.

The fix follows the pattern established in other services: extract all needed
data from ORM objects, call db.commit() to release the transaction, then
proceed with external HTTP calls. This prevents "idle in transaction" states
that exhaust PgBouncer's connection pool under high load.

Load test results (4000 concurrent users, 1M+ requests):
- Success rate: 99.81%
- 502 errors reduced to 0.02% (edge cases with very slow upstreams)
- P50: 450ms, P95: 4300ms

Closes IBM#2518

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* perf(config): tune connection pools for high concurrency

Based on profiling with 4000 concurrent users (~2000 RPS):

- MCP_SESSION_POOL_MAX_PER_KEY: 50 → 200 (reduce session creation)
- IDLE_TRANSACTION_TIMEOUT: 120s → 300s (handle slow MCP calls)
- CLIENT_IDLE_TIMEOUT: 120s → 300s (align with transaction timeout)
- HTTPX_MAX_CONNECTIONS: 200 → 500 (more outbound capacity)
- HTTPX_MAX_KEEPALIVE_CONNECTIONS: 100 → 300
- REDIS_MAX_CONNECTIONS: 150 → 100 (stay under maxclients)

Results:
- Failure rate: 0.446% → 0.102% (4.4x improvement)
- RPC latency: 3,014ms → 1,740ms (42% faster)
- CRUD latency: 1,207ms → 508ms (58% faster)

See: todo/profile-full.md for detailed analysis
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix(helm): stabilize chart templates and configs

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(helm): align migration job with bootstrap

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs(helm): refresh chart README

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* docs: sync env defaults and references

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: sync env templates and performance tuning

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* chore: stabilize coverage target

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: reduce test warnings

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: reduce test startup costs

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: resolve bandit warning

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* test(playwright): handle admin password change

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test(playwright): stabilize admin UI flows

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…BM#2534)

The MCP specification does not mandate that tool names must start with
a letter - tool names are simply strings without pattern restrictions.
This fix updates the validation pattern to align with SEP-986.

Changes:
- Update VALIDATION_TOOL_NAME_PATTERN from ^[a-zA-Z][a-zA-Z0-9._-]*$
  to ^[a-zA-Z0-9_][a-zA-Z0-9._/-]*$ per SEP-986
- Allow leading underscore/number and slashes in tool names
- Remove / from HTML special characters regex (not XSS-relevant)
- Update all error messages, docstrings, and documentation
- Update tests to verify new valid cases

Tool names like `_5gpt_query_by_market_id` and `namespace/tool` are
now accepted.

Closes IBM#2528

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…figuration (IBM#2515)

- Add passphrase-protected key support for Granian via --ssl-keyfile-password
- Add KEY_FILE_PASSWORD and CERT_PASSPHRASE compatibility in run-granian.sh
- Export KEY_FILE in run-gunicorn.sh for Python SSL manager access
- Improve Makefile cert targets with proper permissions (640) and group 0
- Split certs-passphrase into two-step generation (genrsa + req) for AES-256
- Add SSL configuration templates to nginx.conf for client and backend TLS
- Expose port 443 in NGINX Dockerfile for HTTPS support
- Update docker-compose.yml with TLS-related comments and correct cert paths
- Add comprehensive TLS configuration documentation

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…BM#2537)

During gateway activation with OAuth Authorization Code flow,
`_initialize_gateway` returns empty lists because the user hasn't
completed authorization yet. Health checks then treat these empty
responses as legitimate and delete all existing tools/resources/prompts.

This change adds an `oauth_auto_fetch_tool_flag` parameter to
`_initialize_gateway` that:

- When False (default): Returns empty lists for auth_code gateways
  during health checks, preserving existing tools
- When True (activation): Skips the early return for auth_code
  gateways, allowing activation to proceed

The existing check in `_refresh_gateway_tools_resources_prompts` at
lines 4724-4729 prevents stale deletion for auth_code gateways with
empty responses.

Fixed issues from original PR:
- Corrected typo: oath -> oauth in parameter name
- Removed duplicate docstring entry
- Fixed logic bug that incorrectly skipped token fetch for
  client_credentials flow when flag was True


Closes IBM#2272

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* feat(auth): add token revocation and proxy auth to admin middleware

- Support token revocation checks in AdminAuthMiddleware
- Enable proxy authentication for admin routes
- Filter session listings by user ownership
- Validate team membership for OAuth operations
- Add configurable public registration setting

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(config): change token validation defaults to secure-by-default

- Set require_token_expiration default to true (was false)
- Set require_jti default to true (was false)
- Update .env.example to reflect new secure defaults

Tokens without expiration or JTI claims will now be rejected by default.
Set REQUIRE_TOKEN_EXPIRATION=false or REQUIRE_JTI=false to restore
previous behavior if needed for backward compatibility.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs(security): expand securing guide with token lifecycle and access controls

Add documentation for:
- Token lifecycle management (revocation, validation settings)
- Admin route authentication requirements
- Session management access controls
- User registration configuration
- Updated production checklist with new settings

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(auth): address SSO redirect validation and admin middleware gaps

- SSO redirect_uri validation now uses server-side allowlist only
  (allowed_origins, app_domain) instead of trusting Host header
- Full origin comparison including scheme and port to prevent
  cross-port or HTTP downgrade redirects
- AdminAuthMiddleware now supports API token authentication
- AdminAuthMiddleware now honors platform admin bootstrap when
  REQUIRE_USER_IN_DB=false for fresh deployments

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(auth): add basic auth support to AdminAuthMiddleware

Align AdminAuthMiddleware with require_admin_auth by supporting:
- HTTP Basic authentication for legacy deployments
- Basic auth users are treated as admin (consistent with existing behavior)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(auth): finalize secure defaults and update changelog for RC1

- Move hashlib/base64 imports to top-level in main.py (pylint C0415)
- Add CHANGELOG entry for 1.0.0-RC1 secure defaults release
- Add Security Defaults section to .env.example
- Update test helpers to include JTI by default for REQUIRE_JTI=true

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* refactor(auth): streamline authentication model and update documentation

- Simplify Admin UI to use session-based email/password authentication
- Add API_ALLOW_BASIC_AUTH setting for granular API auth control
- Scope gateway credentials to prevent unintended forwarding
- Update 25+ documentation files for auth model clarity
- Add comprehensive test coverage for auth settings
- Fix REQUIRE_TOKEN_EXPIRATION and REQUIRE_JTI defaults in docs
- Remove BASIC_AUTH_* from Docker examples (not needed by default)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: update changelog with neutral language and ignore coverage.svg

- Reword RC1 changelog entries to use neutral language
- Add coverage.svg to .gitignore (generated by make coverage)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
prakhar-singh1928 and others added 5 commits February 18, 2026 08:06
…ing (IBM#2958)

The config validator now converts empty/whitespace-only X_FRAME_OPTIONS
values to None, matching the middleware's expectation that None means
"allow all embedding". Previously, an empty string fell through to the
default DENY behavior in the middleware, blocking embedding unexpectedly.

Closes IBM#2492

Signed-off-by: prakhar-singh1928 <prakhar.singh1928@ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…der (IBM#3026)

* fix(auth): add jwks_uri column to SSOProvider and harden create_provider

Add jwks_uri as a first-class column on SSOProvider for standard OIDC
JWKS endpoint support. Make create_provider defensive by filtering
unknown keys to prevent TypeError crashes during SSO bootstrap.

Closes IBM#3010

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(auth): include jwks_uri in bootstrap_db schema check and document setting

Update _schema_looks_current() to check for sso_providers.jwks_uri,
preventing unversioned databases from being stamped at head without
the new column. Add SSO_GENERIC_JWKS_URI to .env.example.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* lint

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* update roadmap and changelog

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* update links in docs

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai
Copy link
Copy Markdown
Member

Thanks for the thorough rework, @yiannis2804 — the original 3 issues (decorator newlines, allow_admin_bypass, unrelated commits) are all fixed, and jonpspri's feedback has been well-addressed. The PolicyEngine Subject/Resource/Context/Decision design is clean.

However, the new decorator has a production bug that needs fixing before merge:

1. require_permission_v2 returns 500 on endpoints using _db parameter naming (blocking)

The decorator extracts the DB session as:

db = kwargs.get("db") or (user.get("db") if isinstance(user, dict) else getattr(user, "db", None))

This misses endpoints using _db: Session = Depends(get_db) — at least 8 are affected: admin_events, admin_add_root, admin_delete_root, admin_get_import_status, admin_list_import_statuses, check_catalog_server_status, get_observability_partial, get_observability_metrics_partial.

In production, get_current_user_with_permissions returns "db": None (comment says "Session closed; use endpoint's db param instead"), so the fallback user.get("db") is also None, and the decorator raises HTTPException(500, "Database session not available").

Tests don't catch this because rbac_mocks.py was changed from "db": None to "db": MagicMock(), masking the production code path.

Fix:

db = kwargs.get("db") or kwargs.get("_db") or (user.get("db") if isinstance(user, dict) else getattr(user, "db", None))

2. Alembic migration not idempotent

Per project convention (CLAUDE.md), migrations must check before modifying. The migration has 4 op.create_table() calls with no inspector guard. Fresh databases create tables via create_all() from db.py, then Alembic runs and crashes on duplicate CREATE TABLE.

Fix — add at the top of upgrade():

inspector = sa.inspect(op.get_bind())
if "access_permissions" in inspector.get_table_names():
    return

3. sa.text("now()") fails on SQLite

The migration uses server_default=sa.text("now()") in 4 places. SQLite (the project default) doesn't support now(). The ORM models correctly use server_default=func.now() — the migration should match. Use sa.text("CURRENT_TIMESTAMP") or sa.func.now().

4. Performance note (non-blocking)

The new approach loads ALL permissions eagerly via fresh_db_session() + PermissionService in get_current_user_with_permissions on every authenticated request. The old decorator loaded permissions lazily only on protected endpoints. Consider caching at a higher level (TTL cache keyed by user email) in a follow-up.

Phase 1 of IBM#2019 - Core Policy Engine implementation

Added:
- PolicyEngine service with check_access() method
- Database models: AccessPermission, AccessPolicy, AccessDecisionLog, ResourceAccessRule
- Migration to create policy engine tables
- 5 unit tests covering: admin bypass, permission checks, owner access

Features:
- Admin bypass (admins have all permissions)
- Direct permission checking
- Resource owner access
- Team member access to team resources
- Public resource access
- Deny by default

Status: Foundation complete, ready for middleware migration

Related: IBM#2019 Phase 1
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Proof of concept - migrated first endpoint to use new authorization system

Changes:
- Added require_permission_v2 decorator using PolicyEngine
- Migrated GET /servers endpoint from old @require_permission to new decorator
- Import added to main.py

This demonstrates the migration pattern for the remaining 249 endpoints.

Related: IBM#2019 Phase 1
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
…patibility

- Replace @require_permission with @require_permission_v2 across all endpoints
- Add wildcard permission matching (e.g., admin.* matches admin.system_config)
- Add allow_admin_bypass parameter to require_permission_v2 decorator
- Handle both dict and Pydantic model user objects in decorator
- Add SKIP_POLICY_ENGINE env var for backward-compatible test runs
- Add permissions field to test fixtures (rbac_mocks, test_main, test_admin)
- Merge alembic heads (04cda6733305 + policy_engine_phase1)
- Zero test regressions: 11135 passed (same as upstream/main)

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Address code review feedback from @jonpspri:

Problem: Migration used postgresql.JSONB which breaks SQLite compatibility.
Project defaults to SQLite (DATABASE_URL=sqlite:///./mcp.db).

Solution:
- Replaced all 5 instances of postgresql.JSONB with sa.JSON
- Removed PostgreSQL-specific astext_type parameter
- Removed unused postgresql import
- sa.JSON works with both SQLite and PostgreSQL

Result:
- Migration now compatible with SQLite (default database)
- Maintains compatibility with PostgreSQL
- Cross-database compatibility restored

Related: PR IBM#2682 Phase 1 Code Review Item IBM#2
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
…BM#3)

Address code review feedback from @jonpspri:

Problem: When allow_admin_bypass=False, admins still bypassed permission
checks because PolicyEngine.check_access() had its own unconditional
admin bypass at Step 1.

Solution:
- Added allow_admin_bypass parameter to check_access() method
- Updated Step 1 admin bypass: if subject.is_admin AND allow_admin_bypass
- Removed workaround of setting subject.is_admin = False in decorator
- Properly pass allow_admin_bypass from decorator to check_access()

Result:
- When allow_admin_bypass=False, admins must have explicit permissions
- No security regression - behavior matches old decorator
- Cleaner implementation without subject mutation

Testing:
- Verified admin bypass works when allow_admin_bypass=True
- Verified admin bypass blocked when allow_admin_bypass=False
- All 262 admin tests still passing

Related: PR IBM#2682 Phase 1 Code Review Item IBM#3
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Address code review feedback from @jonpspri:

Problem: PR description claimed 'Full audit trail of all access decisions'
but _log_decision() only logged to Python logger with TODO comment.

Solution:
- Updated docstring to clarify DB audit logging is Phase 2
- Changed logging level from INFO to DEBUG (reduce production noise)
- AccessDecisionLog table is created and ready for Phase 2
- Honest about current state vs future implementation

Result:
- Clear documentation that DB audit is Phase 2 scaffolding
- Application logging still captures all decisions
- Table structure ready for Phase 2 implementation
- No misleading claims in PR description

Related: PR IBM#2682 Phase 1 Code Review Item IBM#5
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
)

Address code review suggestion from @jonpspri:

Problem: Subject, Resource, Context, and AccessDecision used manual
__init__ methods with mutable default arguments and no type validation.

Solution:
- Converted all 4 data models to @DataClass
- Used field(default_factory=list/dict) for mutable defaults
- Added __post_init__ to Context for default timestamp
- Renamed Resource parameters: resource_type→type, resource_id→id

Benefits:
- Type validation via dataclass
- Immutable defaults (no mutable default argument bugs)
- Cleaner, more Pythonic code
- Better serialization support
- Protection against common pitfalls

All 262 tests passing with dataclass models.

Related: PR IBM#2682 Phase 1 Code Review Item IBM#6
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
…BM#7)

Address code review suggestion from @jonpspri:

Problem: The _has_permission static method handles *, admin.*, and exact
matches but had no dedicated tests covering edge cases.

Solution:
- Added 5 comprehensive wildcard permission tests
- Tests cover: exact match, *, namespace.*, multiple permissions
- Discovered and fixed bug: admin.* incorrectly matched 'admin'

Bug Fixed:
- admin.* should only match admin.SOMETHING (e.g., admin.system_config)
- It should NOT match just 'admin' (no dot)
- Fixed by removing 'required == prefix' check
- Now correctly requires dot: required.startswith(prefix + '.')

Test Coverage:
✅ test_exact_match - exact permission matching
✅ test_wildcard_star_matches_all - * matches everything
✅ test_namespace_wildcard - admin.* matches admin.anything
✅ test_wildcard_does_not_match_namespace_only - admin.* ≠ admin
✅ test_multiple_permissions - combined permission lists

All 21 PolicyEngine tests passing (16 existing + 5 new).

Related: PR IBM#2682 Phase 1 Code Review Item IBM#7
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
…BM#8)

Address code review suggestion from @jonpspri:

Problem: The _check_resource_access logic (owner, team, visibility) is
well-thought-out but never executed because no callsite passes resource_type
to the decorator. Could be forgotten.

Solution:
- Added comprehensive NOTE explaining this is Phase 2+ scaffolding
- Documents why it's currently not called (no resource_type parameter)
- Provides Phase 2 activation plan with 4 clear steps
- Includes example future usage
- Prevents implementation from being forgotten

Current State:
- Resource always None in check_access()
- _check_resource_access never executes
- Permission checks are permission-level only

Future Phase 2:
- Decorators will pass resource_type
- Extract resource_id from function params
- Fine-grained per-resource access control
- Check ownership, team membership, visibility

Related: PR IBM#2682 Phase 1 Code Review Item IBM#8
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Address code review suggestions from @jonpspri (minor items):

1. ✅ Lazy logging format:
   - Changed logger.debug(f'msg: {var}') to logger.debug('msg: %s', var)
   - Avoids f-string evaluation when debug logging is disabled
   - Applied to _log_decision and check_access methods

2. ✅ Documented per-request instantiation:
   - Added NOTE about PolicyEngine(db) created on every request
   - Acceptable for Phase 1 (stateless, simple)
   - Can be optimized in Phase 2+ with caching/pooling

3. ✅ Improved test DB mocking:
   - Changed db_session fixture from next(get_db()) to MagicMock
   - More portable unit tests without real DB dependency
   - Cleaner test isolation

Additional fixes:
- Updated all test Resource() calls to use type= and id= parameters
- Updated all test AccessDecision() calls to use resource_type= and resource_id=
- Maintains consistency after dataclass conversion in Issue IBM#6

All 21 PolicyEngine tests passing.

Related: PR IBM#2682 Phase 1 Code Review Item IBM#9
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
- Removed unused os import
- Added docstring to __post_init__ method
- Added allow_admin_bypass to check_access docstring
- Applied black formatting

All linting checks passing:
- flake8: ✅
- pylint: 10/10 ✅
- ruff: ✅
- black: ✅
- isort: ✅
- make verify: 10/10 Mascarpone ✅

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
…o v2

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
…permissions

The PolicyEngine decorator expects the user object to have a 'permissions' field,
but get_current_user_with_permissions() was not populating it. This caused all
permission checks to fail with 'Permission denied' errors because users had an
empty permissions list, even if their roles (like platform_admin with wildcard '*')
had permissions assigned.

This fix:
- Loads user permissions from their roles using PermissionService.get_user_permissions()
- Adds the 'permissions' field to the user dictionary
- Ensures wildcard permissions from roles like platform_admin are properly loaded
- The wildcard '*' permission correctly matches any required permission (e.g., 'admin.dashboard')

Fixes Playwright smoke test failures where admin users were denied access.

Related: IBM#2682
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
- Add 'permissions': ['*'] and 'db': MagicMock() to all mock user fixtures
- PolicyEngine.check_access() requires both permissions list and db session
- Fix test_main.py, test_main_extended.py, test_cancellation_router.py
- Fix test_admin_catalog_htmx.py, test_admin_import_export.py
- Fix test_tokens.py, test_teams.py, test_teams_v2.py, test_rbac_router.py
- Fix test_admin_error_handlers.py, test_multi_auth_headers.py
- Fix test_resource_service_plugins.py, test_metrics_rollup_service.py
- Add autouse fixtures for settings-gated services (metrics, log aggregator)
- Fix all pre-commit flake8 and trailing whitespace issues

Fixes: 944 -> 0 test failures (100% pass rate)
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
…mport

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
1. Fix _db parameter extraction in decorator
   - Add kwargs.get('_db') to handle endpoints using _db parameter naming
   - Fixes 8 affected endpoints: admin_events, admin_add_root, etc.

2. Make alembic migration idempotent
   - Add inspector check before creating tables
   - Prevents duplicate CREATE TABLE errors on fresh databases

3. Replace sa.text('now()') with sa.func.now()
   - Fixes SQLite compatibility (SQLite doesn't support now() function)
   - Changed 4 instances to match ORM model patterns

Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
Signed-off-by: yiannis2804 <yiannis2804@gmail.com>
@yiannis2804
Copy link
Copy Markdown
Contributor Author

@crivetimihai Thanks for the thorough review! I've addressed all three blocking issues:

  1. Fixed _db parameter extraction ✅

Added kwargs.get("_db") to the decorator's db extraction logic
Now handles endpoints using _db: Session = Depends(get_db) parameter naming
Fixes the 8 affected endpoints (admin_events, admin_add_root, etc.)

  1. Made alembic migration idempotent ✅

Added inspector check at the top of upgrade() to detect existing tables
Returns early if access_permissions table already exists
Prevents duplicate CREATE TABLE errors on fresh databases

  1. Fixed SQLite compatibility ✅

Replaced all 4 instances of sa.text("now()") with sa.func.now()
Now matches the ORM model patterns and works on SQLite

All unit tests pass (0 failures), migration runs successfully, and is idempotent on re-run.

@crivetimihai
Copy link
Copy Markdown
Member

Reopened as #3189. CI/CD will re-run on the new PR. You are still credited as the author.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request security Improves security SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Centralized configurable RBAC/ABAC policy engine