Skip to content

feat(security): implement policy audit trail and decision logging#2707

Closed
kcostell06 wants to merge 2208 commits intoIBM:mainfrom
kcostell06:feature/audit-logging-2225
Closed

feat(security): implement policy audit trail and decision logging#2707
kcostell06 wants to merge 2208 commits intoIBM:mainfrom
kcostell06:feature/audit-logging-2225

Conversation

@kcostell06
Copy link
Copy Markdown

📋 Issue

Closes #2225

🎯 Summary

Implements comprehensive policy audit trail and decision logging system for IBM MCP Context Forge, meeting all requirements from issue #2225.

📦 Changes

New Files

  • mcp-servers/audit/ - Core audit logging implementation (5 modules)
  • tests/audit/ - Comprehensive test suite (45+ tests)
  • docs/audit/ - Complete documentation

Features Implemented

  • ✅ Audit record logging with complete context
  • ✅ Database storage (SQLite, PostgreSQL-ready)
  • ✅ SIEM integrations (Splunk HEC, Elasticsearch, Webhook)
  • ✅ REST API for querying (FastAPI)
  • ✅ Query filtering (subject, resource, decision, time range)
  • ✅ Statistics and analytics
  • ✅ Export to CSV/JSON
  • ✅ Compliance framework support
  • ✅ Retention management

✅ User Stories Completed

US-1: Security Analyst - Query Access Decisions

# Query decisions by subject email
GET /audit/decisions?subject_email=user@example.com&decision=deny

US-2: Security Team - Export to SIEM

  • Splunk HEC format with batch processing
  • Elasticsearch bulk API integration
  • Generic webhook support

🧪 Testing

All tests passing ✅

Run basic tests:

python3 tests/audit/test_mcp_audit.py
# Result: 16/16 tests passed

Run comprehensive tests:

pytest tests/audit/test_mcp_audit_comprehensive.py -v
# Result: 45+ tests passed

📚 Documentation

  • Implementation Guide: docs/audit/MCP_AUDIT_IMPLEMENTATION.md
  • Testing Guide: docs/audit/TESTING_GUIDE.md
  • Quick Start: docs/audit/QUICKSTART_TESTING.md
  • Test Summary: docs/audit/TEST_SUMMARY.md

🔗 Related Issues

ChrisPC-39 and others added 30 commits January 21, 2026 00:40
Signed-off-by: Chris PC <chrispc@li-4dc2bf4c-325d-11b2-a85c-b68e8b1fc307.ibm.com>
Co-authored-by: Chris PC <chrispc@li-4dc2bf4c-325d-11b2-a85c-b68e8b1fc307.ibm.com>
* fix tags in mcp servers

Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com>

* fix(tags): handle dict-format tags in filtering and read paths

Address issues with dict-format tag storage introduced by migration:

- Add json_contains_tag_expr() to handle both List[str] and List[Dict]
  tag formats in database queries (SQLite, PostgreSQL, MySQL)
- Update all services (gateway, tool, server, prompt, resource, a2a)
  to use json_contains_tag_expr for tag filtering
- Fix catalog_service.py to pass through dict-format tags without
  re-validation
- Restore legacy tag handling in _prepare_gateway_for_read for
  backward compatibility
- Update tests to mock json_contains_tag_expr instead of json_contains_expr

Closes IBM#2203

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…IBM#2253)

This commit fixes a bug where deactivating a virtual server associated with an email notification team would fail with a database error.

The root cause was the use of `joinedload` to fetch the `email_team` relationship when retrieving the server to be updated. This resulted in a `LEFT OUTER JOIN` in the underlying SQL query. When the query also included a
`FOR UPDATE` clause to lock the server row, PostgreSQL raised a `FeatureNotSupported` error because it cannot apply a lock to the nullable side of an outer join.

This fix changes the SQLAlchemy loading strategy from `joinedload` to `selectinload` for the `DbServer.email_team` relationship within the `set_server_state` method. `selectinload` resolves the issue by loading the relate
`email_team` in a separate `SELECT` statement, thus avoiding the problematic `JOIN` in the initial `SELECT ... FOR UPDATE` query.

Additionally, comprehensive unit tests have been added for both the `ServerService` and the admin panel routes to cover server activation, deactivation, and to verify that the fix works as expected, preventing future
regressions.

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>
* tag view

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>

* fix: Handle object tags in token details and improve fallback handling

- Add object-to-string conversion for tags in showTokenDetailsModal
  (was missed in original PR)
- Remove inconsistent blank line in viewGateway tag rendering
- Add JSON.stringify fallback for malformed tag objects without id/label
  (defense-in-depth for edge cases like DB corruption)

Part of the fix for IBM#2267

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>

---------

Signed-off-by: rakdutta <rakhibiswas@yahoo.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crmihai1@ie.ibm.com>
Co-authored-by: Mihai Criveti <crmihai1@ie.ibm.com>
* Fix bump2version

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* bump2version

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* 1.0.0-BETA-2

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* feat: optimize CPU usage in request logging middleware

Add configuration options to reduce CPU and database overhead in
detailed request logging:

- log_detailed_skip_endpoints: List of path prefixes to skip from
  detailed logging (e.g., high-volume or low-value endpoints)
- log_resolve_user_identity: Gate DB fallback for user identity
  resolution behind opt-in flag (default: false)
- log_detailed_sample_rate: Sampling rate (0.0-1.0) to log only a
  fraction of requests when detailed logging is enabled

These optimizations avoid expensive JSON parsing, masking, and identity
lookups unless detailed logging is explicitly enabled and required.

Closes IBM#1865

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add documentation for logging CPU optimization options

Document the new logging configuration options:
- LOG_DETAILED_SKIP_ENDPOINTS: path prefixes to skip from logging
- LOG_DETAILED_SAMPLE_RATE: sampling rate for detailed logging
- LOG_RESOLVE_USER_IDENTITY: opt-in DB lookup for user identity

Updated:
- .env.example with new options and descriptions
- README.md logging table and examples
- Helm chart values.yaml and values.schema.json
- charts/mcp-stack/README.md values table
- docs/config.schema.json (regenerated from Pydantic model)
- docs/docs/config.schema.json (regenerated from Pydantic model)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: add LOG_DETAILED_SKIP_ENDPOINTS to env list normalizer and add tests

- Add LOG_DETAILED_SKIP_ENDPOINTS to _normalize_env_list_vars() to
  support CSV format and empty string values from environment variables
- Add unit tests for skip endpoints, sampling rate, and user identity
  resolution gating in request logging middleware
- Add settings field validation tests for new config options

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* doctest coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: lower doctest coverage threshold to 34%

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…patibility (IBM#2342)

- Use jsonschema.validators.validator_for to detect schema draft automatically
- Support multiple JSON Schema drafts (Draft 4, 6, 7, 2019-09, 2020-12)
- Log warnings for unsupported drafts or invalid schemas instead of raising errors
- Handle None schemas gracefully
- Apply consistent validation behavior to both tool and prompt schemas
- Add comprehensive tests for different schema drafts
- Add fallback validator logic in tool_service.py for runtime validation
- Disable MCP SDK's built-in input validation which uses strict Draft 2020-12

Closes IBM#2322

Signed-off-by: Keval Mahajan <mahajankeval23@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…#2351)

* fix(db): Guard against inactive transaction during async cleanup

When registering MCP servers with long initialization times (like Moody's),
a CancelledError can occur during the MCP session teardown (DELETE request).
This causes the database transaction to become inactive before get_db()
attempts to commit, resulting in:

  sqlalchemy.exc.InvalidRequestError: This transaction is inactive

Add db.is_active checks before commit() and rollback() to handle cases
where the transaction becomes inactive during async context manager cleanup.

Closes IBM#2341

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Add upsert logic for resources and prompts to prevent unique constraint violations

When re-registering a gateway (e.g., after deletion or crash), orphaned resources
and prompts from previous registrations could cause unique constraint violations
on `(team_id, owner_email, uri)` for resources and `(team_id, owner_email, name)`
for prompts.

This fix adds upsert logic that:
1. Queries for existing resources/prompts matching the unique constraint
2. Updates existing records instead of creating duplicates
3. Creates new records only when no match exists

This handles scenarios like:
- Gateway deletion that didn't properly clean up resources (issue IBM#2341)
- Re-registration of the same MCP server under a new gateway name
- Race conditions during concurrent operations

Closes IBM#2352

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: Add cleanup script for orphaned resources/prompts/tools

Adds a utility script to identify and remove database records that were
left orphaned due to incomplete gateway deletions (e.g., IBM#2341 crash).

Usage:
  # Dry run (default) - shows what would be deleted
  python scripts/cleanup_orphaned_resources.py

  # Actually delete orphaned records
  python scripts/cleanup_orphaned_resources.py --execute

  # Filter by team or owner
  python scripts/cleanup_orphaned_resources.py --team-id <id> --owner-email <email>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Only upsert orphaned resources/prompts, add tests

Addresses code review findings:

1. HIGH: Only update truly orphaned records (gateway_id IS NULL or points
   to non-existent gateway). Resources belonging to active gateways are
   no longer at risk of being reassigned.

2. MEDIUM: Use per-resource team/owner overrides when building lookup key,
   matching exactly what would be inserted to avoid constraint mismatches.

3. LOW: Added tests for orphaned resource upsert logic:
   - test_register_gateway_updates_orphaned_resources
   - test_register_gateway_does_not_update_active_gateway_resources

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Always call rollback() in exception handler, improve tests

Addresses remaining code review findings:

1. ROLLBACK GUARD FIX:
   - REMOVED the `if db.is_active:` check before `db.rollback()`
   - Empirical testing proved that:
     * After IntegrityError, is_active becomes False
     * rollback() when is_active=False SUCCEEDS (doesn't fail!)
     * rollback() restores is_active to True, cleaning up the session
     * Skipping rollback when is_active=False leaves session unusable
   - The is_active guard for commit() is CORRECT (commit fails when False)
   - The is_active guard for rollback() was WRONG (rollback is always safe)

2. TEST ASSERTIONS:
   - Rewrote orphaned resource tests with proper assertions
   - Tests now directly verify:
     * Orphaned resources are detected and added to map
     * Resource fields are actually updated during upsert
     * Resources with deleted gateways are detected as orphaned
     * Resources with active gateways are NOT touched
     * Per-resource owner/team overrides are used in lookup key

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: Add file encoding header to cleanup script

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
)

Signed-off-by: RinCodeForge927 <dangnhatrin90@gmail.com>
Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
Signed-off-by: RinCodeForge927 <dangnhatrin90@gmail.com>
Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ps (IBM#2359)

* fix(perf): resolve lock contention and CPU spin loop under high load (IBM#2355)

Phase 1: Replace cascading FOR UPDATE loops with bulk UPDATE statements
in gateway_service.set_gateway_state() to eliminate lock contention when
activating/deactivating gateways with many tools/servers/prompts/resources.

Phase 2: Add nowait=True to get_for_update() calls in set_server_state()
and set_tool_state() to fail fast on locked rows instead of blocking.
Add ServerLockConflictError and ToolLockConflictError exceptions with
409 Conflict handlers in main.py and admin.py routers.

Phase 3: Fix CPU spin loop in SSE transport by properly detecting client
disconnection. Add request.is_disconnected() check, consecutive error
counting, GeneratorExit handling, and ensure _client_gone is set in all
exit paths.

Results:
- RPS improved from 173-600 to ~2000 under load
- Failure rate reduced from 14-22% to 0.03-0.04%
- Blocked queries reduced from 33-48 to 0
- CPU after load test: ~1% (was 800%+ spin loop)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(perf): add database lock timeout configuration (IBM#2355)

Add configurable timeout settings for database operations:
- db_lock_timeout_ms: Maximum wait for row locks (default 5000ms)
- db_statement_timeout_ms: Maximum statement execution (default 30000ms)

These settings can be used with get_for_update() to prevent indefinite
blocking under high concurrency scenarios.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: update tests for bulk UPDATE and SSE transport changes (IBM#2355)

Update gateway_service tests to use side_effect for multiple db.execute
calls (SELECT + bulk UPDATEs) instead of single return_value.

Update row_level_locking test to expect nowait=True parameter in
get_for_update calls for set_tool_state.

Update SSE transport tests to mock request.is_disconnected() and adjust
error handling test to expect consecutive errors causing generator stop
instead of error event emission.

Add missing exception documentation for ServerLockConflictError and
ToolLockConflictError in service docstrings (flake8 DAR401).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add send_timeout to EventSourceResponse to prevent spin loops (IBM#2355)

When Granian ASGI server fails to send to a disconnected client, it logs
"ASGI transport error: SendError" but doesn't raise an exception to our
code. This causes rapid iteration of the generator without proper
timeout handling.

Add send_timeout=5.0 to EventSourceResponse to ensure sends time out if
they fail, triggering sse_starlette's built-in error handling.

Also enable sse_starlette's built-in ping mechanism when keepalive is
enabled, which provides additional disconnect detection.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add rapid yield detection to prevent CPU spin loops (IBM#2355)

When clients disconnect abruptly, Granian may fail sends without
raising Python exceptions. This adds rapid yield detection: if
50+ yields occur within 1 second, we assume client is disconnected
and stop the generator.

New configurable settings:
- SSE_SEND_TIMEOUT: ASGI send timeout (default 30s)
- SSE_RAPID_YIELD_WINDOW_MS: detection window (default 1000ms)
- SSE_RAPID_YIELD_MAX: max yields before disconnect (default 50)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): log rapid yield detection at ERROR level for visibility

Changed from WARNING to ERROR so the detection message is visible
even when LOG_LEVEL=ERROR. This is appropriate since rapid yield
detection indicates a problem condition (client disconnect not
reported by ASGI).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address review findings from ChatGPT analysis

Finding 1: Fix lock conflict error propagation
- Add explicit except handlers for ToolLockConflictError and
  ServerLockConflictError before the generic Exception handler
- This allows 409 responses to propagate correctly instead of
  being wrapped as generic 400 errors

Finding 3: Improve SSE rapid yield detection
- Only track message yields, not keepalives
- Reset the timestamp deque when timeout occurs (we actually waited)
- This prevents false positives on high-throughput legitimate streams

Finding 4: Remove unused db timeout settings
- Remove db_lock_timeout_ms and db_statement_timeout_ms from config
- These settings were defined but never wired into DB operations
- Avoids false sense of protection

Finding 2 (notifications) is intentional: gateway-level notifications
are sent, and bulk UPDATE is used for performance under high load.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(perf): add nowait locks to prompt and resource state changes

Extends the lock contention fix to prompt_service and resource_service:
- Add PromptLockConflictError and ResourceLockConflictError classes
- Use nowait=True in get_for_update to fail fast if row is locked
- Add 409 Conflict handlers in main.py for both services
- Re-raise specific errors before generic Exception handler

This ensures consistent lock handling across all state change endpoints.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): improve rapid yield detection to catch all spin scenarios

- Track time since last yield as additional signal (<10ms is suspicious)
- Check rapid yield after BOTH message and keepalive yields
- Reset timestamps only after successful keepalive wait
- Include time interval in error log for debugging

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add robust spin loop detection and update dependencies (IBM#2355)

SSE transport improvements:
- Add consecutive rapid yield counter for simpler spin loop detection
  (triggers after 10 yields < 100ms apart)
- Remove deque clearing after keepalives that prevented detection
- Add client_close_handler_callable to detect disconnects that ASGI
  servers like granian may not propagate via request.is_disconnected()

Test updates:
- Update row-level locking tests to expect nowait=True for prompt
  and resource state changes

Dependency updates:
- Containerfile.lite: Update UBI base images to latest
- gunicorn 23.0.0 -> 24.1.1
- sqlalchemy 2.0.45 -> 2.0.46
- langgraph 1.0.6 -> 1.0.7
- hypothesis 6.150.2 -> 6.150.3
- schemathesis 4.9.2 -> 4.9.4
- copier 9.11.1 -> 9.11.3
- pytest-html 4.1.1 -> 4.2.0

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add granian worker lifecycle options for SSE connection leak workaround

Document GRANIAN_WORKERS_LIFETIME and GRANIAN_WORKERS_MAX_RSS options
as commented-out configuration in docker-compose.yml and run-granian.sh.

These options provide a workaround for granian issue IBM#286 where SSE
connections are not properly closed after client disconnect, causing
CPU spin loops after load tests complete.

Refs: IBM#2357, IBM#2358
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Update pyproject.toml

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* lint

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…BM#2363)

The Export Config button was not included when pagination was
implemented in PR IBM#1955. This button allows users to export
MCP client configuration in stdio/SSE/HTTP formats.

Closes IBM#2362

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: add sso_entra_admin_groups to list field validator

- sso_entra_admin_groups now properly parses CSV/JSON from environment
- Closes IBM#2265

Signed-off-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com>

* style: fix missing blank lines in test_config.py

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Akshay Shinde <akshayshinde@dhcp-9-162-244-59.mul.ie.ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…scanners (IBM#2200)

- Convert cache.py to use async redis (redis.asyncio) for non-blocking I/O
- Add parallel scanner execution using asyncio.gather in input/output filters
- Add asyncio.to_thread for CPU-bound scanner operations
- Quiet llm_guard logger to ERROR level to reduce noise
- Fix tests to use prompt_id instead of deprecated name parameter
- Update test to use environment variables for redis host/port

Security: Scanner errors now fail-closed (is_valid=False) instead of being
skipped, ensuring policy evaluation denies requests when scanners fail.

Closes IBM#1959

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
1. Replace custom CSS classes with native tailwind utility classes.
2. Add Chart.js theming for dark-mode graphs

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* fix: plugin template test cases.

Signed-off-by: Teryl Taylor <terylt@ibm.com>

* fix: context passing in unit test

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>

---------

Signed-off-by: Teryl Taylor <terylt@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Co-authored-by: Teryl Taylor <terylt@ibm.com>
Co-authored-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Precompile all regex patterns at module or configuration initialization
time across 14 plugins, eliminating per-request compilation overhead.

Closes IBM#1834

Signed-off-by: Shoumi <shoumimukherjee@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
* update jwt cli with more inputs

Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>

* fix: prevent non-expiring tokens from invalid expires_in_days

- Add ge=1 validation to TokenCreateRequest.expires_in_days schema
- Add guard in _generate_token to reject expires_at in the past
- Use math.ceil() and max(1, ...) to ensure exp is always set for
  sub-minute expirations (prevents rounding to 0)
- Mark --secret and --algo CLI args as deprecated (always uses config)
- Add tests for past expiry rejection and ceiling behavior

This fixes a security regression where negative/zero expires_in_days
could create permanent tokens instead of expired ones.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: restore --secret and --algo CLI options

The --secret and --algo CLI parameters now work as optional overrides:
- When provided, they override the configuration values
- When not provided, JWT_SECRET_KEY and JWT_ALGORITHM from config are used

This preserves backward compatibility while still defaulting to
configuration-based signing for consistency.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: require --secret when --algo is specified

Prevent invalid token generation by requiring --secret when --algo
is provided. Using --algo alone would mix config-based keys with a
different algorithm, potentially producing tokens that fail validation.

Also fixes stale docstring that still referenced DEFAULT_SECRET/DEFAULT_ALGO
instead of the new empty-string defaults with config fallback.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
* Initial commit for filesystem server
    - added stdio simple server
    - implemented list_directory tool

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Improved main runner
    - Added argument handler
    - improve tracing
    - implemented streamable-http

Signed-off-by: cafalchio <maolivei@tcd.ie>

* added tracing info for each call

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Implemented search files recursively
    - use glob patterns
    - Added search to tools
    - Handle errors
    - Improve descriptions

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added files for new functions
    - Updated cargo file for file search

Signed-off-by: cafalchio <maolivei@tcd.ie>

* implemented case insensitive search: Lowercase filenames and patterns

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Implemented read_file function and added to a server
    -  check for Max file size 1Mb
    - check if the path is a file

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added tracing  for read file, improved error description

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added get_file_info
    - get size, created, modified and permissions
    - Added to the server

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added Read multiple files
    - use read file for each content
    - read async
    - added to server tools

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added write_file toold func
    - Write to a tempfile uuid
    - rename file to actual name
    - remove tempfile

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added  create_directory  tool

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added create_directory to server and implemented placeholder for list_allowed_directories

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Changed release config to reduce bin size

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Implemented move file function.
    - fails if destination exists

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added move_file to the server

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added edit file to the server

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added edit_file
    - support dry_run
    - use similar to get diffs

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Adding sandbox for path ccheck

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Apply fix to reduce TOCTOU vulnerability
    - atomic write, no checks and write

Signed-off-by: cafalchio <maolivei@tcd.ie>

* improve read to reduce TOCTOU vulnerability

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Stopped to follow symlink on search (security)

Signed-off-by: cafalchio <maolivei@tcd.ie>

* removed unused import

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Improved Sandbox for TOCTOU safety
    - initialize sandbox once
    - check if new folders are inside root
    - resolve path inside root

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added get_roots for server  list_allowed_directories

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Using sanbox resolve path before ger file info

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added sandox to write_file and create_directory
    - validade parent folder
    - clean tempfile after
   - check new folder inside root

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added sandbox to write file
    - canonicalize and check new folders
    -  check parent folders
    - on create directory, check if exists

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added sandbox check  for list_directory

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added sandbox check for edit_file and move_file
   -  check and canonicalize destination parent

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Apply sandbox checks for read_file and read_multiple_files

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Changed sandbox initialization from global to context
    - removed global sandbox
    - initialize sandbox in main and pass to each function

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added tests for searc_files and  list_directories
    - test for symlinks
    - test for path outside roots
    - > 95% coverage
    - formatted using fmt

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Formatted files using cargo fmt

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added tests for get_file_info

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added test coverage for read_file and read_multiple_files

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added unit test coverage for write_file and create_directory

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Improve test coverage for edit_file and move_file

Signed-off-by: cafalchio <maolivei@tcd.ie>

* format edit.rs

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added test coverage for sandbox

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Improve server runner
    - addded sever gracefully shutdown
    -  Declare Config values in main
    -  Improve server logs

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Improved logs for list_allowed_directories and linted file

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Improve test coverage for server
    - iimproved logs in server

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Linted using cargo clippy

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Simplified main.rs and moved server code to lib.rs for integration tests

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added integration tests to simulate workflows
    - file and folder manipulation workflow
    - permission and metadata workflow
    - search and organise workflow
    - server tests

Signed-off-by: cafalchio <maolivei@tcd.ie>

* formatted and clippped integration tests

Signed-off-by: cafalchio <maolivei@tcd.ie>

* added result where it was missing in tool call result
    - rename all outputs to result

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Normalized write file and create directory output

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Structured search and write outputs
    - update tests

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Fixed write result not showing errors correctly
    - server will return WriteResult
    - create_directory return err or string
    - fixed test create_directory tests

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Normalized output result for write_file

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Normalized tool result outpus
    - Keep consistency between tools
    - Return MPC error or success
    - reorganised tools between files
    - updated tests

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added server start banner

Signed-off-by: cafalchio <maolivei@tcd.ie>

* fixed main receiving multiple roots

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Clipped and formatted

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Removed justfile and added Makefile

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added correct readme

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Added dockerfile

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Improved tracing logs

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Update test coverage and binary size  in readme

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Removed umused dependency

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Updated cargo dependencies

Signed-off-by: cafalchio <maolivei@tcd.ie>

* Updted deprecated InitializedRequestParam

Signed-off-by: cafalchio <maolivei@tcd.ie>

* fix: correct error messages and documentation in filesystem server

- Fix misleading error messages in server.rs that said "Error writing file"
  when the actual operation was read, move, or get_file_info
- Update README to accurately reflect that move_file overwrites destination
  (previously incorrectly stated it fails)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: cafalchio <maolivei@tcd.ie>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…ts (IBM#2345)

* Fix proxy authentication

Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com>

* Fix pylint errors

Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com>

* fix: Correct lint issues in proxy auth tests

- Add missing blank line between test classes
- Remove unused jwt import
- Fix excess blank lines

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Include plugin context in proxy auth for cross-hook sharing

Add plugin_context_table and plugin_global_context to proxy
authentication paths, matching the JWT authentication path.
This ensures HTTP_AUTH_CHECK_PERMISSION hooks can access context
set by HTTP_PRE_REQUEST hooks when using proxy authentication.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Address security concerns in proxy authentication

1. RBAC now checks auth_required when proxy header missing
   - Returns 401 for API requests, 302 redirect for browsers
   - Aligns HTTP behavior with WebSocket auth

2. Block anonymous users from token management
   - Add auth_method=="anonymous" to _require_interactive_session
   - Prevents token access when proxy header missing

3. Lookup proxy user admin status from database
   - Check platform_admin_email for admin match
   - Query EmailUser table for is_admin status
   - Enables plugin permission hooks to work correctly

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Align require_auth with RBAC proxy enforcement

Update require_auth to check auth_required when proxy header is
missing, matching the RBAC/WebSocket behavior. Previously returned
"anonymous" even when auth_required=true.

- Raise 401 when mcp_client_auth_enabled=false and no proxy header
  if auth_required=true
- Update tests to cover both auth_required=true and false cases

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mohan Lakshmaiah <mohalaks@in.ibm.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mohan Lakshmaiah <mohalaks@in.ibm.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Madhav Kandukuri <madhav165@gmail.com>
crivetimihai and others added 7 commits February 15, 2026 12:23
…2951)

* feat(auth): implement password reset and recovery workflows

Add self-service forgot/reset password APIs and admin UI flows with one-time reset tokens, SMTP notifications, account unlock actions, lockout expiry fixes, metrics, migration, docs, and recovery tooling.

Closes IBM#2542

Closes IBM#2543

Closes IBM#2628

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* lint fixes

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Test coverage

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Fix password reset issues

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Rebase

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: mark hash_password utility executable

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
IBM#2950)

* fix(ui): standardize loading indicators across admin pages (IBM#2946)

Signed-off-by: Oriol Morros Vilaseca <OM368@student.aru.ac.uk>

* fix(ui): standardize Users loading indicator to match pattern

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Oriol Morros Vilaseca <OM368@student.aru.ac.uk>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Co-authored-by: Mihai Criveti <crivetimihai@gmail.com>
…BM#2865)

* feat: add slow-time-server for timeout, resilience, and load testing

Implements IBM#2783. A configurable-latency Go MCP server modelled on
fast-time-server that introduces artificial delays on every tool call,
serving as a testing target for gateway timeout enforcement, circuit
breaker behaviour, session pool resilience, and load testing.

Server features:
- 5 MCP tools: get_slow_time, convert_slow_time, get_instant_time,
  get_timeout_time, get_flaky_time
- 2 MCP resources: latency://config, latency://stats
- 1 MCP prompt: test_timeout
- 4 latency distributions: fixed, uniform, normal, exponential
- Failure simulation with configurable rate and mode
- Runtime reconfiguration via REST POST /api/v1/config
- Invocation statistics with p50/p95/p99 percentiles
- Multi-transport: stdio, SSE, Streamable HTTP, dual, REST
- 32 unit tests with race detection, all passing

Integration:
- docker-compose.yml: testing profile (port 8889) + auto-registration
- docker-compose-performance.yml: dedicated performance testing service
- Locust load test with 4 scenarios (slow, timeout storm, mixed, circuit breaker)
- Documentation in docs/docs/using/servers/go/slow-time-server.md

Closes IBM#2783

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address review issues in slow-time-server PR

- Fix healthcheck in docker-compose-performance.yml to use binary's
  built-in health check instead of curl (scratch image has no curl)
- Replace mixed atomic+mutex with plain increments under mutex in
  invocationStats.record() for clarity
- Remove dead generateTestTimeoutPrompt() function and unused strings
  import from rest_handlers.go
- Remove unused json and uuid imports from locust test file
- Align env var naming (SLOW_TIME_LATENCY) between compose files

Closes IBM#2783

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: move slow-time-server to dedicated resilience profile

The slow-time-server deliberately introduces latency and failures,
which breaks existing tests when included in the testing profile.
Move it to a dedicated 'resilience' profile so it must be explicitly
opted into.

- docker-compose.yml: profiles ["testing"] -> ["resilience"]
- Makefile: add resilience-up/down/logs targets
- docs: update profile references

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* feat: add Makefile targets for resilience load testing

Add dedicated targets for running Locust and JMeter tests against
the slow-time-server, ensuring these tests only run when explicitly
invoked rather than as part of the regular testing profile.

New targets:
- resilience-locust: headless Locust run (10 users, 120s)
- resilience-locust-ui: Locust web UI on port 8090
- resilience-jmeter: JMeter baseline (20 threads, 5min)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…IBM#2943)

* fix: add alembic upgrade validation and migration compatibility fixes

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(db): handle psycopg JSON deserialization in alembic migrations

Closes IBM#2955

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* format

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ce suite (IBM#2956)

* feat: implement MCP 2025-11-25 compliance suite and dated make targets

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* format

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: exclude compliance tests from default pytest and resolve gosec findings

- Add --ignore=tests/compliance to pytest addopts so compliance suite
  only runs via dedicated make targets (make 2025-11-25, etc.)
- Lazy-import mcpgateway.main in compliance conftest to avoid triggering
  bootstrap_db during test collection
- Fix gosec G114: replace bare http.ListenAndServe with http.Server
  using ReadHeaderTimeout in slow-time-server (4 instances)
- Suppress gosec G404/G705 false positives with nosec annotations

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…BM#2724)

- Add AlreadyEncryptedError and NotEncryptedError (extend ValueError)
  for explicit validation in strict mode
- Introduce v2: format prefix for unambiguous encrypted data detection
- Add strict vs idempotent API modes (decrypt_secret vs
  decrypt_secret_or_plaintext) with backward-compatible async wrappers
- Replace length heuristic in oauth_manager with explicit is_encrypted()
- Add null checks after decryption in dcr_service update/delete
- Migrate encryption tests to dedicated test_encryption_service.py
- Add comprehensive test coverage for edge cases, concurrent operations,
  and real-world token formats (JWT, OAuth2, API keys)

Closes IBM#2405

Signed-off-by: Mohan Lakshmaiah <mohan.economist@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@jonpspri
Copy link
Copy Markdown
Collaborator

jonpspri commented Feb 15, 2026

Hi @kcostell06. A couple things we can do to get this stabilized and merged:

  1. git config user.email "youremail@example.com" (but with the same e-mail you used to create your github account) for code signature
  2. git rebase --signoff origin/main - will reorder your repo and bring your work to the top git log while also re-signing those commits with your GitHub e-mail address.

You should also address the items in the code review above. Feel free to reach out if you require assistance on any of those.

Kelly Costello and others added 2 commits February 15, 2026 22:09
Signed-off-by: Kelly Costello <kellycostello@Kellys-Air.localdomain>
Integrate policy decision audit logging into the gateway, replacing the
standalone mcp_servers/ implementation. Fixes all six reviewer issues:
SQL injection (allowlist validation), blocking async (sync sessions),
missing auth (require_admin_auth), return type mismatch (always returns
PolicyDecision), print→logging, and the integration gap.

Adds SIEM export service (Splunk HEC, Elasticsearch, webhook) with
batch processing, retry/backoff, and lifecycle management in main.py.
Wires policy logging into RBAC enforcement middleware.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kelly Costello <kellycostello@Kellys-MacBook-Air.local>
@jonpspri jonpspri force-pushed the feature/audit-logging-2225 branch from af32330 to 4915856 Compare February 15, 2026 22:19
@jonpspri
Copy link
Copy Markdown
Collaborator

Code Review: PR #2707 - Policy Audit Trail and Decision Logging (#2225)

Author: kcostell06 | Additions: 4,942 | Deletions: 0 | CI: DCO passing

Overview

This PR implements policy decision audit logging with SIEM integration (Splunk, Elasticsearch, Webhook), a REST API for querying decisions, and integration into the RBAC middleware. The feature is behind a policy_audit_enabled flag (default: false).

The PR has two parallel implementations: a standalone MCP server in mcp-servers/audit/ and an integrated version in mcpgateway/. The integrated version is what matters for the gateway.


Critical Issues

1. SQL Injection in Standalone mcp_audit_database.py

mcp-servers/audit/mcp_audit_database.py lines ~219-224:

query = f"""
    SELECT * FROM audit_decisions
    WHERE {where_clause}
    ORDER BY {filter.sort_by} {filter.sort_order}
    LIMIT ? OFFSET ?
"""

filter.sort_by and filter.sort_order are interpolated directly into SQL via f-string with no validation or allowlisting. An attacker who controls these values could inject arbitrary SQL. The gateway-integrated policy_decision_service.py handles this correctly with an allowlist, but this standalone code is still shipped.

2. Deprecated asyncio.get_event_loop() Usage

mcpgateway/services/policy_decision_service.py lines ~148-154:

loop = asyncio.get_event_loop()
if loop.is_running():
    loop.create_task(self._siem_processor.add(record))
else:
    asyncio.run(self._siem_processor.add(record))

asyncio.get_event_loop() is deprecated since Python 3.10 and emits a DeprecationWarning in 3.12+. Use asyncio.get_running_loop() in a try/except instead. Also, calling asyncio.run() from within a sync function that might be called from an async context is unsafe.

3. Plugin-Auth Decisions Not Logged (P1)

mcpgateway/middleware/rbac.py lines 599-608:

When an HTTP_AUTH_CHECK_PERMISSION plugin returns a decision, the wrapper returns or raises immediately, so the new policy-audit logging block (line 641) is never reached. In deployments that rely on plugin-based authorization, both allow and deny outcomes are silently missing from policy_decisions, which defeats the audit trail for a major decision path.


Significant Issues

4. Per-Decision Audit Toggles Not Enforced (P2)

mcpgateway/services/policy_decision_service.py lines 88-94 and mcpgateway/config.py:

settings.policy_audit_log_allowed and settings.policy_audit_log_denied are defined in config but never consulted — log_decision persists every decision whenever policy_audit_enabled is true. Users who disable one decision type (for volume/compliance reasons) will still have those records written.

5. Health Endpoint Requires Admin Auth (P3)

mcpgateway/routers/policy_decisions_api.py lines 28-32:

The router applies Depends(require_admin_auth) globally, so /api/policy-decisions/health also requires admin auth despite its docstring saying it is for unauthenticated monitoring. External health probes without credentials will get 401.

6. Massive Code Duplication

The mcp-servers/audit/ directory contains ~1,800 lines that duplicate what's already in the gateway-integrated modules:

  • mcp_audit_models.py duplicates mcpgateway/common/policy_audit.py
  • mcp_audit_database.py duplicates the SQLAlchemy-based service
  • mcp_audit_siem.py duplicates mcpgateway/services/siem_export_service.py
  • mcp_audit_api.py duplicates mcpgateway/routers/policy_decisions_api.py
  • mcp_audit_service.py duplicates mcpgateway/services/policy_decision_service.py

The standalone version is lower quality (uses print(), raw SQL with f-strings, no auth).

7. Documentation Contains AI Artifacts

  • docs/audit/TEST_SUMMARY.md line 148 references /mnt/user-data/outputs/ which is a path from an AI coding environment
  • All four docs files are heavily duplicative of each other — the same test results table appears 4 times
  • Marketing-style emoji headers and self-congratulatory tone don't match the project's documentation style

8. Import-in-Hot-Path in RBAC Middleware

mcpgateway/middleware/rbac.py line 645:

from mcpgateway.services.policy_decision_service import policy_decision_service

This import runs on every RBAC permission check when policy_audit_enabled=True. While Python caches imports, the overhead of the import machinery on a high-frequency middleware path is unnecessary. Move the import to module level.


Minor Issues

9. Timezone-Naive datetime.now()

  • mcpgateway/services/siem_export_service.py line 249: datetime.now().isoformat() (no timezone)
  • mcp-servers/audit/mcp_audit_service.py: datetime.now() (no timezone)
  • The gateway service properly uses datetime.now(timezone.utc) but these are inconsistent.

10. Migration File Naming

add_policy_decisions_migration.py with revision ID add_policy_decisions doesn't follow the project's hex-hash convention for Alembic migrations. Should be auto-generated via alembic revision --autogenerate. The down_revision = "b1b2b3b4b5b6" is valid (confirmed: it's from PR #2507 on main).

11. Bare except Exception with Debug Logging

mcpgateway/middleware/rbac.py line 660:

except Exception as audit_err:
    logger.debug(f"Policy decision logging failed: {audit_err}")

Using logger.debug for a failed audit write means this will silently vanish in production. For security audit logging, this should be at least logger.warning.


What's Good

  • The gateway-integrated code (mcpgateway/services/, mcpgateway/routers/) is well-structured with proper SQLAlchemy ORM, parameterized queries, and sort-column allowlisting
  • Feature is gated behind policy_audit_enabled=False by default, so it won't affect existing deployments
  • Tests for the integrated services are reasonable — they mock the DB layer properly and test auth requirements
  • The SIEM batch processor design with configurable flush intervals and retry logic is solid
  • Admin auth is properly applied to all query endpoints

Recommendation

Request changes. The functional gaps (plugin decisions not logged, config toggles ignored, health endpoint locked) need fixing before merge. The standalone code's SQL injection and the deprecated asyncio usage are also blockers.

@jonpspri
Copy link
Copy Markdown
Collaborator

Fix PR #2707: Policy Audit Trail Issues

Context

PR #2707 implements policy decision audit logging (#2225) for MCP Gateway. The integrated gateway code (mcpgateway/services/, mcpgateway/routers/) is structurally sound, but has functional gaps (plugin decisions not logged, config toggles unused, health endpoint locked behind auth) and code quality issues (deprecated asyncio API, timezone bugs, standalone duplicate code with SQL injection). This plan addresses all review findings plus three priority issues (P1-P3) identified in follow-up review.


1. [P1] Log plugin-auth decisions before early return

File: mcpgateway/middleware/rbac.py lines 599-608

Problem: When plugin HTTP_AUTH_CHECK_PERMISSION returns a decision, the decorator returns/raises immediately at line 603/605, skipping the audit logging block at line 641. Plugin-based allow/deny decisions are never logged.

Fix: Add audit logging inside the plugin decision block (lines 600-608), before the return and raise statements. Extract the audit call into a helper to avoid duplicating the logging logic between the plugin path and the RBAC path.

# Helper function (define inside require_permission or at module scope)
def _log_policy_decision(user_context, permission, resource_type, decision_str, engines):
    if settings.policy_audit_enabled:
        try:
            from mcpgateway.services.policy_decision_service import policy_decision_service
            policy_decision_service.log_decision(
                action=permission,
                decision=decision_str,
                subject_id=user_context.get("email"),
                subject_email=user_context.get("email"),
                resource_type=resource_type,
                ip_address=user_context.get("ip_address"),
                user_agent=user_context.get("user_agent"),
                request_id=user_context.get("request_id"),
                policy_engines_used=engines,
                severity="info" if decision_str == "allow" else "warning",
            )
        except Exception as audit_err:
            logger.warning(f"Policy decision logging failed: {audit_err}")

Use this helper in both the plugin path (line ~600, with engines=["plugin"]) and the existing RBAC path (line ~641, with engines=["rbac"]), replacing the inline block.


2. [P2] Honor per-decision audit toggles

File: mcpgateway/services/policy_decision_service.py lines 88-94

Problem: settings.policy_audit_log_allowed and settings.policy_audit_log_denied (defined in config.py) are never checked. All decisions are logged when policy_audit_enabled=True.

Fix: Add toggle checks after the policy_audit_enabled guard:

if not settings.policy_audit_enabled:
    ...return early...

# Check per-decision toggles
if decision == "allow" and not settings.policy_audit_log_allowed:
    return PolicyDecision(action=action, decision=decision, timestamp=datetime.now(timezone.utc))
if decision == "deny" and not settings.policy_audit_log_denied:
    return PolicyDecision(action=action, decision=decision, timestamp=datetime.now(timezone.utc))

3. [P3] Health endpoint without admin auth

File: mcpgateway/routers/policy_decisions_api.py lines 28-32, 182-193

Problem: Router-wide dependencies=[Depends(require_admin_auth)] forces auth on /health, contradicting its docstring ("no auth required for monitoring").

Fix: Create a separate router for the health endpoint without the auth dependency, and keep the main router for authenticated endpoints:

# Unauthenticated health router
health_router = APIRouter(prefix="/api/policy-decisions", tags=["Policy Decisions"])

@health_router.get("/health", ...)
def health_check(): ...

# Authenticated data router (existing, keeps dependencies=[...])
router = APIRouter(prefix="/api/policy-decisions", tags=["Policy Decisions"],
                   dependencies=[Depends(require_admin_auth)])

Then in mcpgateway/main.py, include both routers when policy_audit_enabled.


4. Fix deprecated asyncio.get_event_loop()

File: mcpgateway/services/policy_decision_service.py lines ~148-154

Problem: asyncio.get_event_loop() is deprecated in Python 3.12+.

Fix:

try:
    loop = asyncio.get_running_loop()
    loop.create_task(self._siem_processor.add(record))
except RuntimeError:
    # No running event loop — skip async SIEM queuing
    logger.debug("No running event loop; SIEM queuing deferred")

5. Fix timezone-naive datetime.now()

File: mcpgateway/services/siem_export_service.py line 249

Fix: Change datetime.now().isoformat() to datetime.now(timezone.utc).isoformat().

Also fix in mcp-servers/audit/mcp_audit_service.py line ~118 (datetime.now()datetime.now(timezone.utc)).


6. Upgrade audit failure log level

File: mcpgateway/middleware/rbac.py line 660

Fix: Change logger.debug(...) to logger.warning(...) — silent audit failures in production are a security concern. (This will be handled via the helper function from fix #1.)


7. Fix standalone mcp-servers/audit/ SQL injection

File: mcp-servers/audit/mcp_audit_database.py lines ~219-224

Problem: filter.sort_by and filter.sort_order interpolated via f-string into raw SQL.

Fix: Add allowlist validation before the query:

ALLOWED_SORT_COLUMNS = {"timestamp", "subject_id", "resource_id", "decision", "action"}
ALLOWED_SORT_ORDERS = {"asc", "desc"}
sort_by = filter.sort_by if filter.sort_by in ALLOWED_SORT_COLUMNS else "timestamp"
sort_order = filter.sort_order if filter.sort_order.lower() in ALLOWED_SORT_ORDERS else "desc"

8. Fix standalone mcp-servers/audit/ print() calls

File: mcp-servers/audit/mcp_audit_siem.py (multiple lines)

Fix: Replace all print(...) calls with logging.getLogger(__name__) usage. Add import logging and logger = logging.getLogger(__name__) at module level. Same for mcp_audit_database.py if any print calls exist.


9. Clean up docs/audit/ AI artifacts

Files: All 4 files in docs/audit/

Fixes:

  • TEST_SUMMARY.md line 148: Remove /mnt/user-data/outputs/ path reference — replace with relative repo paths
  • Remove duplicated test results (same table appears 4 times across files)
  • Remove marketing-style emoji headers; align with project documentation style
  • Remove self-referential language ("ALL PASSING", "100% core functionality")
  • Consolidate the 4 largely-redundant docs into fewer files where possible

10. Regenerate Alembic migration with proper ID

File: mcpgateway/alembic/versions/add_policy_decisions_migration.py

Fix: The revision ID add_policy_decisions doesn't follow the hex-hash convention. Regenerate with alembic revision --autogenerate -m "add_policy_decisions_table" to get a proper hash ID, then port the idempotent upgrade/downgrade logic from the existing file. Keep down_revision = "b1b2b3b4b5b6" (confirmed valid — it's from PR #2507 on main).


11. Update tests

Files:

  • tests/unit/mcpgateway/routers/test_policy_decisions_api.py
  • tests/unit/mcpgateway/services/test_policy_decision_service.py

Changes:

  • Health endpoint test: Update to verify health endpoint works without auth (new health_router)
  • Per-decision toggle test: Add test that log_decision() skips when policy_audit_log_allowed=False for allow decisions and policy_audit_log_denied=False for deny decisions
  • Plugin audit logging test: Add test verifying that plugin decisions trigger policy_decision_service.log_decision() with policy_engines_used=["plugin"]
  • Existing health auth test: Update test_health_check_no_auth to test against the separated health router

12. Include health router in main.py

File: mcpgateway/main.py lines ~7208-7216

Fix: Import and include both routers:

if settings.policy_audit_enabled:
    from mcpgateway.routers.policy_decisions_api import router as policy_decisions_router
    from mcpgateway.routers.policy_decisions_api import health_router as policy_health_router
    app.include_router(policy_decisions_router)
    app.include_router(policy_health_router)

Files to modify

File Changes
mcpgateway/middleware/rbac.py Extract audit helper, add plugin-decision logging, upgrade log level
mcpgateway/services/policy_decision_service.py Add per-decision toggles, fix asyncio deprecation
mcpgateway/services/siem_export_service.py Fix timezone-naive datetime
mcpgateway/routers/policy_decisions_api.py Separate health endpoint into unauth router
mcpgateway/main.py Include health router
mcpgateway/alembic/versions/add_policy_decisions_migration.py Regenerate with proper revision ID
mcp-servers/audit/mcp_audit_database.py Fix SQL injection with sort allowlist
mcp-servers/audit/mcp_audit_siem.py Replace print() with logging
mcp-servers/audit/mcp_audit_service.py Fix timezone-naive datetime
docs/audit/MCP_AUDIT_IMPLEMENTATION.md Clean AI artifacts
docs/audit/QUICKSTART_TESTING.md Clean AI artifacts
docs/audit/TESTING_GUIDE.md Clean AI artifacts
docs/audit/TEST_SUMMARY.md Clean AI artifacts, fix /mnt/user-data/outputs/ path
tests/unit/mcpgateway/routers/test_policy_decisions_api.py Update health test, add plugin audit test
tests/unit/mcpgateway/services/test_policy_decision_service.py Add per-decision toggle tests

Verification

  1. Unit tests: pytest tests/unit/mcpgateway/services/test_policy_decision_service.py tests/unit/mcpgateway/routers/test_policy_decisions_api.py tests/unit/mcpgateway/services/test_siem_export_service.py -v
  2. Lint: make autoflake isort black pre-commit then make flake8 bandit interrogate pylint verify
  3. Migration check: cd mcpgateway && alembic heads — verify the policy decisions migration appears with a hex ID
  4. Manual smoke test: Start with POLICY_AUDIT_ENABLED=true make dev, verify /api/policy-decisions/health returns 200 without auth, verify /api/policy-decisions/decisions returns 401 without auth

…2225)

- Fix Alembic migration: rename to hex revision, merge two heads
- Remove duplicate mcp-servers/audit/ standalone implementation
- Add sort_by/sort_order pattern validation on GET endpoint
- Remove unnecessary db.commit() from read-only queries
- Separate health endpoint into unauthenticated router
- Add plugin-auth decision logging in rbac.py
- Honor per-decision audit toggles (log_allowed/log_denied)
- Fix deprecated asyncio.get_event_loop() -> get_running_loop()
- Fix timezone-naive datetime.now() in siem_export_service
- Upgrade audit failure log level to warning
- Clean up docs/audit/ artifacts
- Include health_router in main.py
- Add per-decision toggle and health endpoint tests

Closes IBM#2225

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kelly Costello <kellycostello@Kellys-MacBook-Air.local>
@kcostell06
Copy link
Copy Markdown
Author

Hi — I've pushed changes to address all the review feedback from @crivetimihai and @jonpspri. Here's a summary of what was done:

Alembic Migration

  • Renamed migration to use proper hex revision (373c8cc5e13a)
  • Fixed down_revision to merge both existing heads (b1b2b3b4b5b6, w2b3c4d5e6f7), resolving the multiple-heads issue

Duplicate Code Removal

  • Deleted the entire standalone mcp-servers/audit/ directory (6 files, ~1,850 lines) — the real implementation lives in mcpgateway/ and this duplicate had SQL injection and other issues

API Fixes

  • Added pattern= regex validation to sort_by and sort_order on the GET /decisions endpoint (matching the POST model)
  • Removed unnecessary db.commit() from read-only query_decisions() and get_statistics() methods
  • Separated the /health endpoint into its own unauthenticated health_router, and included it in main.py

RBAC / Audit Logging

  • Added _log_policy_decision() helper in rbac.py to log both plugin-auth and RBAC decisions
  • Plugin decisions are now logged before the early return/raise
  • Upgraded audit failure log level from debug to warning

Per-Decision Toggles

  • log_decision() now honors policy_audit_log_allowed and policy_audit_log_denied settings, skipping DB writes when toggled off

Async / Timezone Fixes

  • Replaced deprecated asyncio.get_event_loop() with asyncio.get_running_loop() in policy_decision_service.py
  • Fixed timezone-naive datetime.now()datetime.now(timezone.utc) in siem_export_service.py

Documentation Cleanup

  • Rewrote docs/audit/MCP_AUDIT_IMPLEMENTATION.md — removed emojis, self-referential language, and updated file references
  • Deleted redundant QUICKSTART_TESTING.md and TEST_SUMMARY.md

Tests

  • Updated test_policy_decisions_api.py: health endpoint test now verifies it works without any auth dependency
  • Added 3 new tests in test_policy_decision_service.py for per-decision toggle behavior
  • All 15 audit-related tests pass

@jonpspri jonpspri requested a review from Lang-Akshay February 16, 2026 12:25
- Rewrite PolicyDecision ORM model to Mapped[]/mapped_column() style
  matching AuditTrail convention in db.py
- Remove duplicate index=True from columns covered by composite indexes
- Add missing named indexes to migration, make downgrade idempotent
- Extract lazy-cached singleton for policy_decision_service in rbac.py
- Convert f-string logging to %s style in service and router modules
- Add bounded queue with drop-oldest to SIEMBatchProcessor
- Add shutdown no-progress guard to prevent _flush_all hanging
- Add tests for flush_all, shutdown hang prevention, and queue bounds
- Register PolicyDecision on Base.metadata for create_all bootstrap
- Trim TESTING_GUIDE.md to reference actual test paths
- Update copyright years to 2026

Signed-off-by: Jonathan Springer <jps@s390x.com>
Copy link
Copy Markdown
Collaborator

@Lang-Akshay Lang-Akshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kcostell06

Thanks allot for the PR, really appreciate it.
Can you run following tests locally and ensure they are passing.

  • Format (If this fails create a commit -> fixup:formatting)

    make autoflake isort black pre-commit

  • Unit Test

    make doctest test lint-web flake8 bandit interrogate pylint verify

  • Check coverage (ensure at-least 99% coverage)

    make coverage

  • Test with migration (Make sure you are running docker daemon)

    make docker-prod testing-down testing-up

Commits the changes after all the tests.

Thanks again.

@crivetimihai
Copy link
Copy Markdown
Member

Thanks for the thorough rework, @kcostell06. 4 of 5 original issues are fixed:

  • SQL injection via f-string sort_by — fixed (allowlist + regex validation, good defense-in-depth)
  • Standalone mcp-servers/audit/ duplication — fixed (removed)
  • GET /decisions missing sort_by validation — fixed (regex pattern constraint)
  • Read-only query_decisions calls db.commit() — fixed (only db.close())

Still broken: Alembic down_revision points to wrong parent (blocking)

down_revision: Union[str, Sequence[str], None] = "w6g7h8i9j0k1"

The current head on main is x7h8i9j0k1l2 (the jwks_uri migration). The PR points to w6g7h8i9j0k1 (password reset tokens), which is the parent of the current head. This will create multiple alembic heads and break migrations.

Fix: down_revision = "x7h8i9j0k1l2"

Verify with: cd mcpgateway && alembic heads (should show single head after fix).

@crivetimihai crivetimihai added COULD P3: Nice-to-have features with minimal impact if left out; included if time permits enhancement New feature or request labels Feb 21, 2026
Kelly Costello and others added 2 commits February 22, 2026 16:06
…2225)

- Add SIEM export service tests (39 tests covering all exporters and batch processor)
- Add policy_audit ORM model serialization tests
- Add policy_decision_service tests for SIEM queuing, query filters, and statistics
- Add API error handler tests for 500 responses
- Fix import ordering (isort) and pre-commit formatting
- Merge third Alembic head (w6g7h8i9j0k1) to resolve multiple-heads error
- Exclude untracked ip_control files from coverage

Coverage: 99% (53,217 stmts, 458 missed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kelly Costello <kellycostello@Kellys-Air.localdomain>
…1l2 (IBM#2225)

The migration's down_revision was pointing to w6g7h8i9j0k1 (parent of
current head) instead of x7h8i9j0k1l2 (the jwks_uri migration that is
the actual current head on main). This would have created multiple
Alembic heads and broken migrations.

Also adds the x7h8i9j0k1l2 migration file from upstream main so the
revision chain is complete on this branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kelly Costello <kellycostello@Kellys-Air.localdomain>
@kcostell06
Copy link
Copy Markdown
Author

Alembic Migration Fix (ee8e423)

Addressed @crivetimihai's feedback on the broken migration chain.

Problem: The down_revision in 373c8cc5e13a_add_policy_decisions_migration.py was pointing to w6g7h8i9j0k1 (password reset tokens), which is the parent of the current head on main. This would have created multiple Alembic heads and broken migrations.

Fix:

  • Updated down_revision from the old tuple ("b1b2b3b4b5b6", "w2b3c4d5e6f7", "w6g7h8i9j0k1") to "x7h8i9j0k1l2" — the actual current head on main (the jwks_uri migration)
  • Added x7h8i9j0k1l2_add_jwks_uri_to_sso_providers.py to the branch so the revision chain is complete
  • Removed stale untracked migration file (w2b3c4d5e6f7) that was creating a phantom second head

Verification:

$ cd mcpgateway && alembic heads
373c8cc5e13a (head)

Single head confirmed. The migration chain is now: x7h8i9j0k1l2373c8cc5e13a.

@crivetimihai crivetimihai changed the title feat: implement policy audit trail and decision logging (#2225) feat(security): implement policy audit trail and decision logging Feb 23, 2026
crivetimihai pushed a commit that referenced this pull request Feb 24, 2026
- Fix Alembic migration: rename to hex revision, merge two heads
- Remove duplicate mcp-servers/audit/ standalone implementation
- Add sort_by/sort_order pattern validation on GET endpoint
- Remove unnecessary db.commit() from read-only queries
- Separate health endpoint into unauthenticated router
- Add plugin-auth decision logging in rbac.py
- Honor per-decision audit toggles (log_allowed/log_denied)
- Fix deprecated asyncio.get_event_loop() -> get_running_loop()
- Fix timezone-naive datetime.now() in siem_export_service
- Upgrade audit failure log level to warning
- Clean up docs/audit/ artifacts
- Include health_router in main.py
- Add per-decision toggle and health endpoint tests

Closes #2225

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kelly Costello <kellycostello@Kellys-MacBook-Air.local>
@crivetimihai
Copy link
Copy Markdown
Member

Reopened as #3191. CI/CD will re-run on the new PR. You are still credited as the author.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

COULD P3: Nice-to-have features with minimal impact if left out; included if time permits enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE][POLICY]: Policy audit trail and decision logging