Skip to content

fix(db): Guard against inactive transaction during async cleanup#2351

Merged
crivetimihai merged 6 commits intomainfrom
2341-mcp-issue
Jan 23, 2026
Merged

fix(db): Guard against inactive transaction during async cleanup#2351
crivetimihai merged 6 commits intomainfrom
2341-mcp-issue

Conversation

@crivetimihai
Copy link
Copy Markdown
Member

@crivetimihai crivetimihai commented Jan 23, 2026

Summary

This PR fixes two related issues affecting gateway registration with Moody's MCP server:

  1. [BUG]: MCP CF crashes while listing tools from moody's mcp server #2341 - Transaction Inactive Crash: Gateway crashes with InvalidRequestError: This transaction is inactive after successfully fetching tools/resources
  2. [BUG]: Multiple gateway import failing with unique constraint error for resources #2352 - Unique Constraint Violation: Re-adding a gateway fails with UniqueViolation on uq_team_owner_uri_resource

Root Cause Analysis

Issue #2341: Transaction Inactive During Async Cleanup

Timeline from logs:

Timestamp Event
09:42:54.938Z Fetched 10 tools from gateway
09:42:55.506Z Fetched 53 resources from gateway
09:42:56.024Z Fetched 1 resource templates from gateway
09:42:56.378Z DELETE .../mcp "HTTP/1.1 204 No Content"
09:42:56.416Z InvalidRequestError('This transaction is inactive')

Root Cause: CancelledError during MCP session cleanup (DELETE request) causes the transaction to become inactive. Then get_db() tries to commit() on the inactive transaction.

Fix: Add db.is_active checks before commit() and rollback() in get_db().

Issue #2352: Unique Constraint Violation on Re-Registration

The Constraint:

# Resources unique on (team_id, owner_email, uri) - NOT including gateway_id
UniqueConstraint("team_id", "owner_email", "uri", name="uq_team_owner_uri_resource")

Failure Sequence:

  1. Gateway A registered → resources created with (team_id, owner_email, uri)
  2. Gateway A deleted → but [BUG]: MCP CF crashes while listing tools from moody's mcp server #2341 crash leaves orphaned resources
  3. Gateway B registered (same MCP server) → tries to INSERT same resources → CONFLICT

Fix: Add upsert logic for orphaned resources and prompts only:

  • Query for existing records where gateway_id IS NULL or points to non-existent gateway
  • Update orphaned records instead of creating duplicates
  • Resources belonging to active gateways are NOT touched

Changes

mcpgateway/main.py

  • Added db.is_active guard before commit() in get_db()
  • Added db.is_active guard before rollback() in exception handler
  • Added db.invalidate() fallback for broken connections

mcpgateway/services/gateway_service.py

  • Added upsert logic for orphaned resources only (query by team_id, owner_email, uri)
  • Added upsert logic for orphaned prompts only (query by team_id, owner_email, name)
  • Uses per-resource team/owner overrides when building lookup key
  • Only updates records where gateway_id IS NULL or gateway doesn't exist

scripts/cleanup_orphaned_resources.py (NEW)

  • Cleanup script to remove orphaned records from affected databases

tests/unit/mcpgateway/services/test_gateway_resources_prompts.py

  • Added test_register_gateway_updates_orphaned_resources
  • Added test_register_gateway_does_not_update_active_gateway_resources

Manual Cleanup (For Affected Users)

If you experienced the #2341 crash, you may have orphaned resources/prompts/tools in your database. A cleanup script is provided:

# Dry run (default) - shows what would be deleted
python scripts/cleanup_orphaned_resources.py

# Actually delete orphaned records
python scripts/cleanup_orphaned_resources.py --execute

# Filter by team or owner
python scripts/cleanup_orphaned_resources.py --team-id <team_id>
python scripts/cleanup_orphaned_resources.py --owner-email <email>

Note: The upsert fix will handle orphaned records automatically on re-registration (they get reassigned to the new gateway). The cleanup script is only needed if you want to proactively remove orphaned records without re-registering.


Code Review Findings (Addressed)

Finding Severity Status
Upsert updates any existing resource without verifying it's orphaned High ✅ Fixed - now only updates records where gateway_id IS NULL or gateway doesn't exist
Resource lookup uses gateway-level keys but inserts use per-resource overrides Medium ✅ Fixed - lookup now uses per-resource team_id and owner_email overrides
No tests for upsert logic Low ✅ Fixed - added 2 tests for orphaned resource handling

Test Plan

  • Run gateway service tests: 75 passed
  • Run gateway resources/prompts tests: 6 passed
  • Run all service tests: 1795 passed, 34 skipped
  • Integration test with Moody's MCP server

Closes #2341
Closes #2352

When registering MCP servers with long initialization times (like Moody's),
a CancelledError can occur during the MCP session teardown (DELETE request).
This causes the database transaction to become inactive before get_db()
attempts to commit, resulting in:

  sqlalchemy.exc.InvalidRequestError: This transaction is inactive

Add db.is_active checks before commit() and rollback() to handle cases
where the transaction becomes inactive during async context manager cleanup.

Closes #2341

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai
Copy link
Copy Markdown
Member Author

crivetimihai commented Jan 23, 2026

Root Cause Analysis

Timeline from Logs

Timestamp Event
09:42:54.938Z Fetched 10 tools from gateway
09:42:55.506Z Fetched 53 resources from gateway
09:42:56.024Z Fetched 1 resource templates from gateway
09:42:56.378Z DELETE https://screening-ai-services.npe.moodys.cloud/mcp "204 No Content"
09:42:56.416Z InvalidRequestError('This transaction is inactive')
09:42:56.418Z POST /gateways HTTP/1.1" 500

The Trigger

The logs show a CancelledError during MCP session cleanup:

receive_response_body.failed exception=CancelledError("Cancelled via cancel scope ... by
<Task pending name='starlette.middleware.base.BaseHTTPMiddleware.__call__.<locals>.call_next.<locals>.coro' ...")

Sequence of Events

  1. Request startsget_db() creates session, yields it
  2. Gateway registration → Tools/resources fetched successfully
  3. MCP session cleanup → DELETE request sent
  4. CancelledError → During response body reception
  5. Exception propagates → Transaction implicitly rolled back
  6. FastAPI cleanupget_db() resumes after yield
  7. Commit attempted → Transaction already inactive → CRASH

Moody's Server Details

  • Tools: 10 (getLegalDisclaimers, getAuthStatus, screenAndWait, etc.)
  • Resources: 53
  • Resource Templates: 1
  • Protocol Version: 2025-11-25

…straint violations

When re-registering a gateway (e.g., after deletion or crash), orphaned resources
and prompts from previous registrations could cause unique constraint violations
on `(team_id, owner_email, uri)` for resources and `(team_id, owner_email, name)`
for prompts.

This fix adds upsert logic that:
1. Queries for existing resources/prompts matching the unique constraint
2. Updates existing records instead of creating duplicates
3. Creates new records only when no match exists

This handles scenarios like:
- Gateway deletion that didn't properly clean up resources (issue #2341)
- Re-registration of the same MCP server under a new gateway name
- Race conditions during concurrent operations

Closes #2352

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Adds a utility script to identify and remove database records that were
left orphaned due to incomplete gateway deletions (e.g., #2341 crash).

Usage:
  # Dry run (default) - shows what would be deleted
  python scripts/cleanup_orphaned_resources.py

  # Actually delete orphaned records
  python scripts/cleanup_orphaned_resources.py --execute

  # Filter by team or owner
  python scripts/cleanup_orphaned_resources.py --team-id <id> --owner-email <email>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Addresses code review findings:

1. HIGH: Only update truly orphaned records (gateway_id IS NULL or points
   to non-existent gateway). Resources belonging to active gateways are
   no longer at risk of being reassigned.

2. MEDIUM: Use per-resource team/owner overrides when building lookup key,
   matching exactly what would be inserted to avoid constraint mismatches.

3. LOW: Added tests for orphaned resource upsert logic:
   - test_register_gateway_updates_orphaned_resources
   - test_register_gateway_does_not_update_active_gateway_resources

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Addresses remaining code review findings:

1. ROLLBACK GUARD FIX:
   - REMOVED the `if db.is_active:` check before `db.rollback()`
   - Empirical testing proved that:
     * After IntegrityError, is_active becomes False
     * rollback() when is_active=False SUCCEEDS (doesn't fail!)
     * rollback() restores is_active to True, cleaning up the session
     * Skipping rollback when is_active=False leaves session unusable
   - The is_active guard for commit() is CORRECT (commit fails when False)
   - The is_active guard for rollback() was WRONG (rollback is always safe)

2. TEST ASSERTIONS:
   - Rewrote orphaned resource tests with proper assertions
   - Tests now directly verify:
     * Orphaned resources are detected and added to map
     * Resource fields are actually updated during upsert
     * Resources with deleted gateways are detected as orphaned
     * Resources with active gateways are NOT touched
     * Per-resource owner/team overrides are used in lookup key

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai merged commit 5f14122 into main Jan 23, 2026
53 checks passed
@crivetimihai crivetimihai deleted the 2341-mcp-issue branch January 23, 2026 17:06
kcostell06 pushed a commit to kcostell06/mcp-context-forge that referenced this pull request Feb 24, 2026
…#2351)

* fix(db): Guard against inactive transaction during async cleanup

When registering MCP servers with long initialization times (like Moody's),
a CancelledError can occur during the MCP session teardown (DELETE request).
This causes the database transaction to become inactive before get_db()
attempts to commit, resulting in:

  sqlalchemy.exc.InvalidRequestError: This transaction is inactive

Add db.is_active checks before commit() and rollback() to handle cases
where the transaction becomes inactive during async context manager cleanup.

Closes IBM#2341

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Add upsert logic for resources and prompts to prevent unique constraint violations

When re-registering a gateway (e.g., after deletion or crash), orphaned resources
and prompts from previous registrations could cause unique constraint violations
on `(team_id, owner_email, uri)` for resources and `(team_id, owner_email, name)`
for prompts.

This fix adds upsert logic that:
1. Queries for existing resources/prompts matching the unique constraint
2. Updates existing records instead of creating duplicates
3. Creates new records only when no match exists

This handles scenarios like:
- Gateway deletion that didn't properly clean up resources (issue IBM#2341)
- Re-registration of the same MCP server under a new gateway name
- Race conditions during concurrent operations

Closes IBM#2352

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: Add cleanup script for orphaned resources/prompts/tools

Adds a utility script to identify and remove database records that were
left orphaned due to incomplete gateway deletions (e.g., IBM#2341 crash).

Usage:
  # Dry run (default) - shows what would be deleted
  python scripts/cleanup_orphaned_resources.py

  # Actually delete orphaned records
  python scripts/cleanup_orphaned_resources.py --execute

  # Filter by team or owner
  python scripts/cleanup_orphaned_resources.py --team-id <id> --owner-email <email>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Only upsert orphaned resources/prompts, add tests

Addresses code review findings:

1. HIGH: Only update truly orphaned records (gateway_id IS NULL or points
   to non-existent gateway). Resources belonging to active gateways are
   no longer at risk of being reassigned.

2. MEDIUM: Use per-resource team/owner overrides when building lookup key,
   matching exactly what would be inserted to avoid constraint mismatches.

3. LOW: Added tests for orphaned resource upsert logic:
   - test_register_gateway_updates_orphaned_resources
   - test_register_gateway_does_not_update_active_gateway_resources

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: Always call rollback() in exception handler, improve tests

Addresses remaining code review findings:

1. ROLLBACK GUARD FIX:
   - REMOVED the `if db.is_active:` check before `db.rollback()`
   - Empirical testing proved that:
     * After IntegrityError, is_active becomes False
     * rollback() when is_active=False SUCCEEDS (doesn't fail!)
     * rollback() restores is_active to True, cleaning up the session
     * Skipping rollback when is_active=False leaves session unusable
   - The is_active guard for commit() is CORRECT (commit fails when False)
   - The is_active guard for rollback() was WRONG (rollback is always safe)

2. TEST ASSERTIONS:
   - Rewrote orphaned resource tests with proper assertions
   - Tests now directly verify:
     * Orphaned resources are detected and added to map
     * Resource fields are actually updated during upsert
     * Resources with deleted gateways are detected as orphaned
     * Resources with active gateways are NOT touched
     * Per-resource owner/team overrides are used in lookup key

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* chore: Add file encoding header to cleanup script

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Multiple gateway import failing with unique constraint error for resources [BUG]: MCP CF crashes while listing tools from moody's mcp server

1 participant